On 21 June 2002, the Forced Migration Online (FMO) Digital Library was launched by the Refugee Studies Centre, Queen Elizabeth House, University of Oxford. The FMO Digital Library has made available an initial tranche of c. 3000 documents of unpublished (grey) literature on forced migration and refugee issues, items ranging from one to 700 pages in length, giving a total page count of around 70,000 pages. The documents cover the last 40 years, and most of the material derives from the unique collection of some 15,000 items of grey literature available at the RSC library in paper form. The digital collection covers most regions of the world, and most topics that make up the diverse subject area of refugee studies. Also available in the digital library is a smaller collection from the Feinstein International Famine Centre at Tufts University, which deals mostly with famine and nutritional issues. After carrying out a feasibility study and a number of pilots, Olive Software’s ActivePaper Archive was chosen to deliver the digital library, and has already proved an excellent tool for our purposes.
The Document Collection
The materials in the collection derive from a wide variety of sources, and are in different formats and conditions. Grey literature is a crucial source of information in subjects like refugee studies which are relatively new, interdisciplinary by their very nature, and concerned with practice in humanitarian assistance as well as with academic study and reflection. Twenty years ago, when the RSC was established in Oxford, there were very few books and journals that dealt with this subject area, and much valuable data was only accessible in unpublished sources: reports from intergovernmental agencies such as UNHCR, or from NGOs, and in conference papers and a range of other personal communications written by individuals. The RSC therefore began to collect this literature, catalogue it, and make it available to a wide range of scholars, students and practitioners—most of whom travelled to Oxford from all over the world. An online catalogue was produced in 1995. Realising the value of the collection, in 1996 the Andrew W Mellon Foundation and the European Union granted substantial sums of money for digitization to make it more widely available. The Digital Library project began in 1997.
Production of the Digital Library
The greatest problem is such a modern collection is of course that every item is in copyright. One of the first tasks to be embarked upon therefore was the investigation of the copyright issues, and the establishment of a workflow process to clear copyright and produce a full audit trail of the copyright clearance activities for future queries. This involved a mixture of legal consultation, technical development and sheer hard work. Mostly, individuals and organisations were very willing to allow us to use their materials, but given the sensitivity of the subject area, there were some items that we were unable to make available. The two biggest problems with the copyright process were the time everything took, and the difficulty in actually locating the copyright holders. Individuals in particular were hard to track down, as people working in this field are often highly mobile: humanitarian assistance requires aid workers to go wherever the current crises may be. For some, there may have been a period of more than 10 years from when they donated a document to our collection to the point where we were seeking permission to scan, and they might have moved three or four times. Some authors never responded at all, and so we have proceeded cautiously with making their materials available. In UK law, having made every effort to secure the rights, we are able to put the documents on line with suitable disclaimers about ownership. In order to ease the workflow process, and to enable us to record all the transactions in detail, we produced our own copyright clearance database in MS Access. This has been working smoothly for more than three years and a number of other projects are now also using this.
As the copyright process took much longer than we had ever envisaged, this therefore held up the scanning process considerably. This also meant that we incurred higher costs: we did not actually pay for any permissions given that the materials are unpublished, but the costs in staff time, printing, postage etc were considerable. Now that we have good systems, trained staff and smooth-running workflow process, ongoing work on copyright will (hopefully) be easier. A fuller account of the copyright process can be found in RLG’s Diginews at http://webdoc.gwdg.de/edoc/aw/rlgdn/preserv/diginews/diginews4-5.html#feature1.
The digitization process
The RSC collection is highly diverse, and consists of pamphlets, newsletters, theses, reports, faxes, letters, and other kinds of unpublished documents. Our first task in deciding how to scan this and make it available was to survey the materials and see what issues and problems might arise. A feasibility study was therefore carried out by the UK Higher Education Digitisation Service (HEDS) to assay the formats, typestyles, bindings, colour/greyscale content and condition of the collection. Some 800 documents (selected randomly but at regular intervals from the collection to provide a representative sample) were inspected and a range of features recorded. From these documents, a subset was chosen for sample scanning. The subset represented the full range of conditions (excellent, good, fair, poor), and a range of features - text only, greyscale content, colour content, difficult type styles. Sample pages were scanned at 300, 400, and 600dpi, and at a 1, 8, and 24 bits. These samples were examined carefully and compared with the originals so that benchmarks could be established. From this process we derived our standard scanning parameters: the majority of our documents turned out to be black & white typescript in good condition, and are scanned at 300dpi bitonal. If there is greyscale or colour content, this is scanned at 8 bits. Note that we are not aiming for fidelity to the original, but at accessibility of the content. We decided that for the content we have and the audience we are serving, little is added by presenting the colour content (which is very limited) in full colour, and much is lost in terms of performance, given the file sizes that 24 bit scanning creates. Documents in poorer condition are scanned at 400dpi. We have never found it necessary to scan at a higher resolution: the resolutions chosen give acceptable readability of even the smallest significant features, produce excellent OCR output, and also give good quality printed output.
All scanning is outsourced to HEDS, who also carry out the QA process. Document preparation takes a significant time, given that detailed inventories of every page sent need to be produced. After scanning, the documents are reintegrated into the paper collection. Documents are disbound for scanning, then stapled together for return to the library. There is not felt to be significant artefactual value in the documents to justify the extra cost of non-destructive scanning. Images are captured as single TIFFs with Group 4 compression and provided to us on CD ROMs.
Choosing the delivery system
In 1997, when the development of the digital library began, there was no obvious candidate as a delivery system for the digital library: our hope was that one would come along as we went through the processes described above. When the feasibility study was complete, we decided to carry out a pilot project on some 200 documents. This was designed to help us establish our workflow processes in the selection, scanning and delivery of the documents, and also to test some candidate delivery systems. One huge advantage that the project started with was an online catalogue to the complete grey literature collection, produced over twenty years in the Cardbox cataloguing system. This has been available on the web since 1995 (see below for further details of this). Our main requirements for a delivery system were:
- It should be able to handle bibliographic records
- It should be able to perform full-text searching
- It should be browser-based with no plug-ins (if possible)
- It should be fast over unreliable links, given that we are delivering to the developing world
- It should use uncorrected OCR for indexing, and deliver the page image for reading and printing
- It should adhere to international standards
These requirements were a tall order back in 1997!
Five systems were tested in the initial pilot, two of which are still available for searching. Only one of these answered all the above requirements. These were:
1. OpenText (with the Bodleian Library)
The RSC library catalogue is delivered on the web using the OpenText retrieval system. Using techniques developed for the Internet Library of Early Journals (ILEJ) project, Richard Gartner of the Bodleian Library took our 200 pilot documents and attached them to records downloaded from the catalogue, with full text searching capability added. This is still available at http://rsc.qeh.ox.ac.uk/rsccat. OpenText offered good bibliographic support as well as full text searching, and it also allowed us to deliver page images. However, OpenText has not been supported by the company that produced it for some years now, and we felt that that was a major factor in our rejection of this as a route for our development. The system has now been adopted by the University of Michigan and is distributed by them as the DLXS system, but this development happened after our choice of Olive's ActivePaper Archive as our delivery system.
2. DBTextworks (with the Institute for Development Studies, Sussex)
The Institute for Development Studies uses the DBTextworks library software for delivering its ELDIS gateway to online information on development and environment. This has excellent bibliographic capabilities, and also allows fields with full text or images. They also took our 200 pilot documents and made them available at http://nt1.ids.ac.uk/eldis/rsp/rspsea.htm. The images are no longer available on this system as we have since moved them to another server, but the records can be searched. Bibliographic handling on this system was to our standards, but it performed less well with the text searching and image handling.
3. Muscat (with De Montfort University)
Muscat is a powerful text search system for full-text databases. It works on relatively poor OCR, and gives relevance rankings for searches, as well as opportunities to improve searches: the search engine automatically suggests words related to the enquiry and generates a more focused search. The system displayed excellent images, but they proved slow to download. It was also weak in bibliographic support.
4. Reachcast (with Tel Aviv University)
Reachcast was the precursor to ActivePaper Archive and was produced by IOTA software. It was developed for searching newspaper text, and gave excellent results for our text searching requirements, with the added value that hits were highlighted on the page—the first system we had seen that could offer this. However, the bibliographic support was poor, and it required users to download a proprietary browser, which proved very slow, especially over slow dial-up connections.
5. Olive Software's ActivePaper Archive
The pilots carried out above were completed by the end of 1999, and none seemed quite suitable for our purposes. Then at the end of 2000, we became aware of APA, which seemed to offer most of the features specified above: it has good bibliographic support, and also supports Dublin Core; it is XML-based so adheres to international standards; it has excellent OCR capability on even relatively compromised originals; it uses standard browser facilities and so doesn’t require plug-ins (though for better printing, there is the possibility to download an Acrobat file), and it highlights the hits on the page. It is also fast, scalable, and performs as well with huge document collections as it does with small ones. We therefore chose this to deliver the system.
Developing the delivery system
ActivePaper Archive was developed primarily for the digitization of historic newspaper collections, and the FMO team worked closely with Olive and OCLC on the British Library Newspaper Pilot (see www.uk.olivesoftware.com). We therefore understood the potential of APA, but we were asking for something rather different. We wanted a system that would present our grey literature in a simple interface, with our catalogue records, and would also allow powerful and flexible searching. Based upon our pilots, we produced a detailed functional specification for Olive to work to in the production of the system. This was the first time that APA was to be used for collections other than newspapers, and so the Olive technical team worked closely with the FMO team in Oxford, and also with the technical partners at the Centre for Computing in the Humanities (CCH) at King's College London. OCLC had some technical input into the process, too. The development phase took around 9 months, with a preliminary release after three months to allow for testing and then two more releases before the final launch of the system in June 2002.
We are delighted with the system that we have produced—all the hard work has been worthwhile. Our users are also very pleased, and feedback is positive and constructive. We want to add content rapidly to the digital library, and also extend the range of materials we will have available: we are working with Olive at the moment on the addition of full back runs of key journals in the field of forced migration to the database. Forced Migration Online will eventually be a portal for the study of forced migration, with many more components: the full portal will be launched in November 2002 and will provide instant access to a wide variety of online resources concerning the situation of forced migrants world-wide. Designed for use by practitioners, researchers, policy makers, students or anyone interested in the field, FMO aims to give comprehensive information in an impartial environment and to promote increased awareness of human displacement issues to an international community of users. In order to achieve this, we work with a number of international partners, including the Feinstein International Famine Center and the Fletcher School of Law and Diplomacy at Tufts University; the Program on Migration and Public Health, Columbia University, New York; The Czech Helsinki Committee, Prague; and the American University, Cairo. With the help of Olive and OCLC, we will create federated digital collections from collections held by these partners and others, and make them available free for world-wide use.