BSD News • October 11, 2018
In the far north of Norway, near the Arctic Circle, experts at the National Library of Norway’s (NLN) secure storage facility are in the process of implementing an astonishing plan.
They aim to digitize everything ever published in Norway: books, newspapers, manuscripts, posters, photos, movies, broadcasts, and maps, as well as all websites on the Norwegian .no domain.
Their work has been going on for the past 12 years and will take 30 years to complete by current estimations.
At the moment, the library has more than 540,000 books and over 2,000,000 newspapers in its archive. These have been mass-scanned and OCR-processed before being stored, so all the content in the library is free-text searchable.
As of early September, the collection amounted to 8.1 petabytes of data and is growing by between five terabytes and 10 terabytes every day, Svein Arne Solbakk, department director for digital library development at the NLN, tells ZDNet.
NLN’s mandate isn’t just long-term safe storage. It is also making its archives available for the public, so it needs online storage for publishing the collection.
“Just to be able to handle the large amounts of data, we must have it online. If I get a PDF file from a newspaper, I know this format won’t last for a thousand years. I’ll have to convert it to a modern format, probably several times during those thousand years,” Solbakk says.
He illustrates this point by explaining that they’ve already had to complete their first large-scale format conversion, involving 50 million image files. This process took 10 servers three months of 24/7 processing to complete, even though the files were stored on hard disks.
Furthermore, given the relatively short life of hard disks, the NLN’s approach is to have a rolling program of disk replacement, swapping out entire disk cabinets when they reach their expected lifespan of five years.
In addition, the NLN stores everything in triplicate. One copy is on hard disk, with two more copies on tape. The tape storage is an archive system based on Oracle SAM-FS, so it’s not a traditional tape backup system.
“When we’re talking petabytes, we can’t talk about backup. A petabyte restore from tape would take weeks,” Solbakk says. Thus, the NLN’s system is more of a storage-virtualization approach that is currently handling more than 24 petabytes in total.
Some 83 percent of all books and 40 percent of all newspaper pages have been digitized. In addition, the NLN is among several other projects currently working on scanning 100,000 radio broadcast tapes before the tape players needed for the job disappear for good. It’s easy to be impressed by the NLN’s ambition.
“We are ambitious, but it’s very important to document the present for the future,” Solbakk concluded.
NATIONAL LIBRARY OF NORWAY’S DIGITAL COLLECTION, SEPTEMBER 2018
The tape storage in Norway based on Oracle SAM-FS