Bio: James Snyder is a digital media engineering, production & project management specialist. His extensive experience includes television, film, radio, internet technologies and covers the gamut from traditional analog to cutting edge digital data, audio and video technologies. His career in both commercial and non-commercial sectors spans over 30 years. Mr. Snyder currently serves as the Senior Systems Administrator for the Library of Congress' National Audio-Visual Conservation Center located on the Packard Campus for Audio Visual Conservation in Culpeper, Virginia. He is responsible for all the audio, video and film preservation and digitization technologies, including long-term planning, technology services to the United States Congress and Capitol Hill, as well as standards participation and interaction with media content producers.
Q. What is your formal title, and how did you arrive with the job of overseeing the technical aspects of archiving for the Audio-Visual Center?
Snyder. My title is senior systems administrator for the National Audio-Visual Conservation Center. I got the job because I was one of the two design engineers that helped to design the technical plant of the building, and the folks at the library realized that part of the reason why I was such a good fit as one of the design engineers was because I also knew a lot about not only history in general but also the history of media and technology, which has actually been part of my hobbies over the years, and so it just turned out to be an advantageous fit.
My career has been quite varied. I started off at the proverbial small public television station in east central Indiana back in 1980. In fact February 15th will be my thirtieth anniversary in the business so to speak. And for the first half of my career, I did mainly radio and television production including pretty much every position you can think of. I've been a cameraman and an editor and a sound recordist. I have run film. I've shot film. I have done live broadcasts of symphonies over NPR. I've done a whole bunch of different things over the years.
Q. Let's talk about the formation of the Audio-Visual Conservation Center. How did it come about? Who were some of the key individuals that also helped support the construction of it, and what was the vision of this facility?
Snyder. There were a couple of different things all going on at once. In the standards arena, there was a standard being created back in the late 1990s called the JPEG 2000 standard, and I'll get back to why that's important in a little bit. The Library of Congress realized that to be able to really access and preserve the collection well, they would have to concentrate the collection in one building, because the library's collections has roughly 142 million items in its collection of which roughly six and a quarter million are in our division—the motion picture broadcasting and recorded sound division. And essentially we're the non-print media division of the library, if you will. And then, you know, there are prints and photographs. There are periodicals. There are books, and there are other sections of the library. But, in essence, we're predominantly the non-print media section.
And so to be able to access any of our content, you, first of all, had to find it amongst eight different federal storage facilities spread around the Eastern United States, and they realized, the folks at the top of the library, including the Librarian of Congress, because every library does indeed have a librarian, and our librarian is Dr. James Billington, the folks at the top of the library realized that they needed to bring the collections together and at some point they needed to be able to migrate them to whatever the next format would be. Especially with audio and videotape, they are much more ephemeral than most picture film is. That is, they don't last anywhere near as long as motion picture film does. And so they knew that they were – even back in the late 1990s – rushing against a clock to start moving these older audio and videotapes before they would decay into something unusable.
Now, at this point, one of the participants in one of the library's planning committees was a gentleman by the name of David Packard who is the son of the Packard in Hewlett-Packard, and because he was privy to the Packard fortune, he inherited a great deal of money. He had been a supporter of motion picture restoration and motion picture theater restoration. He had a great interest in motion picture film and making sure that the old motion pictures, especially the old black and white motion pictures before 1950, were preserved in a manner that would allow them to be seen by future generations. He had a great deal of interest in that.
One of the sad facts of the motion picture industry is as you go back in time fewer and fewer of the motion pictures produced still existed. From the silent era, supposedly only about 5% of the films that are pre-1930 still exist, which is really sad if you think about it.
And so Mr. Packard was definitely very motivated to help out, and so what they did was they made an arrangement where Mr. Packard agreed to donate the funds to create what is now the building I am sitting in, which has been dubbed the Packard Campus for Audio-Visual Conservation. And what they did is they took the old Federal Reserve Bank of Richmond nuclear bunker, which is sitting on a hill outside of Culpeper, Virginia, which is about 70 air miles southwest of Washington, and they said, “We're going to renovate the nuclear bunker which used to hold billions of dollars in currency so that if the nuclear bomb ever went off, the survivors would have fresh, non-irradiated currency.” And once they realized that, that probably wasn't the best idea in the world, they cleared out all the currency, and they gave the building to the Packard Foundation with the understanding that the Packard Foundation would give it back with a lot more attached of the building at a later point in time.
And so the Packard Foundation, the Packard Humanities Institute paid for, designed, and built this building of which about a third of it is the Old Federal Reserve Bank Building underground but renovated, another third of it is the conservation building, the building I actually sit in and where all the technical equipment is, and that building is where all the work actually goes on. That's aboveground. That's a horseshoe-shaped building that you can actually see from U.S. Route 29 and downtown Culpeper. And then they built another underground wing which was specifically for the storage of nitrate film.
And for those who don't know, nitrate film was the media, the actual physical media that the light-sensitive emulsion was put on before 1951, and it was originally made out of a solution of guncotton and nitroglycerin, amongst other things, and one of the problems is as it ages, the nitroglycerin and the guncotton tend to come and express themselves again. So nitrate is known for when it starts to decay it produces its own oxygen, and as it decays, it produces heat, and so it can spontaneously combust. And once it spontaneously combusts, once it bursts into flame, you cannot put the fire out because it generates its own oxygen.
So nitrate film is considered very explosive, and so we have a specially-designed wing of the building that does nothing but store nitrate film in cold vaults that are steel-reinforced concrete walls with blast chimneys above them, so just in case any of the film actually does go up, it doesn't take the rest of the collection with it. And so we have one entire wing specifically for preserving the millions of feet of nitrate film that we have in our collection.
The Old Federal Reserve Bank Building is the other preservation wing, and we call that Phase 1 because that was the building that existed originally, and that takes the bulk of the rest of the collection, in other words, all the safety film, all of the videotape, audiotape, disks, Edison wax cylinders, you know, all the various other things that are in the collection. And we have examples of just about every recorded technology, whether audio, video, or film, going back to the beginning of the very ability to record.
Now, before the center was created that was the way the library got around something like that. If 20 years ago a two-inch videotape was showing signs of decay and they thought that if they waited any longer they would actually lose the content, they would dub it to, like, a Betacam SP. We don't do that anymore. We don't do anything to analog unless someone specifically ask us to. And the only people who ask us to do that are the copyright content owners. If they want something back in a particular format, they'll ask us for a particular format. We don't archive anything to analog, we do everything in digital.
Q. What are some of the challenges in preserving media?
Snyder. The problem is all media is ephemeral. All media decays over time. A quick way to tell how in danger something potentially is, is by how complex is the device that it's recorded on and how is the particular storage medium -- how does it age. And so what we've noticed is things like hard drives, because they are very complex devices, some of them last a very long time. Some of them are dead out of the box. Some of them will last for a month. Some of them will last for a year. Some of them if you drop them a certain way will have a head crash and they'll never work again and will cost you a couple thousand bucks to get the data back if you can afford it – so the more complex the device, the more in danger it potentially is.
For things like optical disks, if they are not stored in absolutely perfect condition, which basically means, you know, something akin to how we store things here, both dye-based and pressed optical disks have shown a truly scary ability to decay in a fairly short period of time. And so I tell folks if you want to store things on optical disks, go ahead, but just be an informed consumer and be aware that you will have a fair amount of your data -- especially if you hold these things for 10 or 15 or 20 years, a fair amount of your data will be lost by the time you get to the 20-year mark. There's no doubt about it. How much data loss percentage-wise? I can't tell you. It depends on how you recorded it, how you stored it, the quality of the media, how much you paid for that media, what the production run was. It's kind of like the videotape problem. You know, it really is a whack-a-mole. There are 15 different parameters that can change it.
And so in the end, if you want to have the most robust way of recording digital data that really, really -- you have a really good chance of getting it to the 20-year to 30-year mark. I tell folks if you can afford it, do magnetic enterprise-class data-grade tape. That's the only way to as close as possible guarantee it. The problem is most of us can't afford that, including myself. So the fact is I do as much as I can with what I've got which is hard drives and making double copies, and that over time is actually the best way to guarantee as much of your content will get into the future as possible.
But, you know, we've got an entire generation now of folks who've been shooting things on digital cameras and shooting things on camcorders. All these things are files, and unless you actually print them out and put them in a photograph album, in the case of photographs -- you really can't do that with video -- you know, what's to say that your grandchildren will be able to see pictures of you at this point in time. There's no guarantee whatsoever.
Q. I think you said press-based optical disks used in hard drives. Why is the failure rate so high?
Snyder. Well, it's the nature of how they're manufactured, especially with dye layers. One of the problems in the film world is even though film, you know, the actual physical substrate, the cellulose acetate that they put the light-sensitive emulsion on, cellulose acetate if you store it properly will last, you know, 500 or a thousand years. However, there's a big difference between the emulsion. If you put black and white emulsion on that film, it will last for 500 to a thousand years. If you put dye-based color emulsion on there, you will see a noticeable difference in the picture in two to five years, and you will see the motion picture turn usually completely red by ten years. And that's simply because you are using dyes. Dyes decay over time. And so anything that uses a dye layer, especially, you know, an organic dye layer, it will decay and it will decay fairly fast.
That's not to say that a large number of disks aren't going to make it to the 10- or the 20-year mark, because they will. They will have beaten the odds. But I'd be willing to bet you have a multiple double-digit failure rate by the 10-year mark.
And there's a little bit better average on pressed disks. In other words, the disks that actually -- you know, we don't use them in the home, of course, but, you know, your DVDs and your CDs, where, basically, they've pressed the bits into the plastic itself and then they sprayed an aluminum substrate on it to reflect the laser, those actually last a little bit better. But even then if you don't store them in pretty much pristine condition, you know, in the dark away from, you know, temperature and humidity and you don't -- you keep them in as good condition as you can, even then you're going to notice some degradation over time, because it's just the nature of this, you know, millimeter-thick aluminum or, you know, gold, whatever the substrate is. The fact is those decay over time, period. They decay by nature of the beast and by nature of being so thin.
And so that's not to say that, you know, multiple double-digit percentage are going to survive to the 10-, 20-, maybe even the 30-year mark, but I tell you, the stuff that the tests that are coming out of the various labs, including our own testing lab here at the library, they're not very encouraging. They're actually kind of scary.
Q. You described the term ‘bit rot’ as a process of implosion where you start to lose data in these ones and zeros until you've lost a file or can't access a file.
Snyder. Sure. Or it's corrupted enough that you cannot recover the original structure of the file, and if you can't recover the structure, you can't decode what's in the file. So if enough of the file -- especially the headers and the footers that are in there, because every file has a header and a footer, then the essence can be uncorrupted, but if the headers and the footers are corrupted enough, the file will be unrecoverable.
Which is really scary. And, you know, one of the key things -- and this is kind of what makes people's heads spin, you know, 360 degrees when I say it, but one of the things you have to keep in mind is all digital is analog. All of it is analog. It's just how you're coding the signal that you're putting onto the tape, the disk, or whatever. And so the fact is everything in nature decays over time, and so the zeros and the ones aren't really zeros and ones. What they are is they are highs and lows, whether it is a high voltage and a low voltage or a high physical point and a low physical point or in the case of an optical disk, say, a high reflection of the laser light versus a low reflection of the laser light. It's a high and it's a low.
So basically what it is, is whatever you've recorded on has decayed to the point where the machine that's trying to recover it can't tell the difference between a high and a low. A high is a one. A low is a zero. If you can't tell the difference between them, guess what? Your content's gone, or at least that particular, you know, segment of data is now corrupted.
Q. Can you describe the process of what you do when you receive an analog source that has to be digitized? Let's say if it hasn't been in the best conditions, like a tape reel that's been in a very humid room, how do you prepare that to try to get the best transfer possible to digital?
Snyder. Well, first of all, especially if it's come from a dirty environment, before it ever ends up in any of our vaults, any of the storage vaults, when they first arrive we have a series of clean rooms in the building where as you move through the various rooms they take off the old cardboard box packaging that's made out of acid paper, and they replace the reels. They replace many of the portions -- unless there's relevant writing on it or relevant information, the packaging gets replaced, especially for things that have been in musty or moldy areas. Mold is actually one of the great destroyers of all media. We have rooms where we have vacuums and we can treat various types of media to be able to get as much of that mold and mildew off as possible so that once it gets into the vaults it's as clean as possible, because we don't want anything contaminating any of the other things that are in storage. And then when something is identified as being important for preservation, there tend to be two ways that they're identified, although they're not the only ways, but either they are deemed historically significant by the curators here at the library and they are scheduled for preservation or someone makes a request. They see that we've got an old Billie Holiday original 78 master, you know, an aluminum master somewhere in our collection that is the original studio session that she did in, you know, 1948 let's say. And someone requests it at one of the research rooms on Capitol Hill, because we are to a great extent a research facility, and even though we don't host people down in Culpeper, we do service the research rooms up on Capitol Hill.
And so when someone actually makes a request and says, “I want to hear this original pressing.” Then at that point it gets onto the list. And of course because we're the Library of Congress, members of Congress can also ask for certain things, although they don't tend to ask for historic items, but on the odd occasion they do, also is how something gets on the list for preservation.
So once they're identified for preservation, they are brought out of the deep storage vaults. And the vaults are somewhere between 25 and 45 degrees and somewhere between 35 and 60% relative humidity, depending on the media and various other things. They're allowed to acclimate to room temperature or humid room temperature over the course of several days so that you don't have a thermal shock from bringing it out of cold storage into the warm air in the preservation building. And, in fact, we all have our jackets in our office so that if we ever have to go into either of the preservation buildings I have a winter jacket in here so I don't freeze my little butt off whenever I go into the preservation wing. That's one of the standard tools that we have. We have a winter jacket that we wear in the building, because I'm literally 25 feet from cold storage. My office is 25 feet from Phase 1.
And so once it's acclimated, audio recordings especially will go up to the third floor. The third floor is where we do virtually all preservation in the building. They will do one more pass. Especially if it's a grooved item, they will do one more pass at cleaning. If it is tape, then the tape is inspected again to make sure all the stuff that may have been on the tape has gotten cleaned off. If the tape shows signs of something like sticky shed, which is where the layers of the tape start to stick together because some of the binders have started to come off of the tape and join together, then we bake the tape, which is where we heat it up to somewhere between 120 to 140 degrees and for a prescribed period of time, depending on the thickness and the weight of the tape, the amount of tape that's on the reel that basically forces the moisture out of the sticky part, because the sticky is basically moisture plus the original binder starting to haze, and that gives you a couple of hours to play that tape and get that information off. And there are times where the tapes are in such a bad shape that we may get one pass before the tape itself is destroyed. The very act of playing it may destroy the tape, so that may be the one pass that we get. And so we have a bunch of folks upstairs, the audio engineers, who are very careful about how they play tapes, how they play disks, because in some cases, this may be the only time and the only chance they get to digitize the material.
Q. So really human eyes, hands, and ears are involved in every step of the process from sort of diagnosing or determining the importance of the material, what state it's in, and then the processes that need to take place before it's actually ready to put into the archive and digitized.
Snyder. Absolutely. For the analog side of what we're doing, human eyes, ears, and hands rarely can be superseded by anything mechanical.
Q. And that is quite a task. You have over a hundred years of a variety of forms of media. That's an incredible challenge and also an opportunity to make this material -- not just archive it digitally but make it material for generations to come?
Snyder. Absolutely. Well, it's also an opportunity to open up what has been, in essence, a dark archive, because it took a tremendous amount of equipment and personnel to be able to retrieve these old analog formats and play them back in a way, if someone was doing research you had to have all the various videotape formats. You had to be able to play them back for someone who wanted to look at a two-inch quad tape or a one-inch IVC tape or whatever the flavor may have been. And so the ability to play back and allow people access to do research, or just to see the footage for that matter, was tremendously hampered by the fact that you had all these different formats and they were stored in all these different areas and they really weren't terribly extensible or accessible.
And so what file-based formats and what the conversion to digital also allows us to do is to open up our archive for the very first time and get it cataloged in a much more usable way and to also for the first time allow folks to actually see stuff without having to wait two, four, six weeks for something to be pulled back from a mountain in Pennsylvania where it's being stored in cold storage and then played back in a reading room somewhere on Capitol Hill. So access is a big part of what we're intending to do, and that's part of our mission here is to open up this archive in a way that it never could be in the analog era.
Q. You mentioned before the format which you landed upon was JPEG 2000. Could you touch upon this format and why it's so important for the library.
Snyder. Really the enabling technology for us to be able to do lossless preservation -- and that's really the goal of any long-term preservation. You don't want the next format that you copy onto to have more problems than the format you are copying it off of. And one of the great enabling technologies of this new digital era is video and audio compression, but of course one of the problems with video and audio compression is they tend to throw away essence. They throw away either frequency in the audio or detail in the video that enables them to do the bit rate reduction that is the essence of compression.
Now, they do it in what's known as a visually or aurally lossless manner, but there comes a point where you can only throw away so much without it being noticeable, either in the sound or in the video. And so at some point, something's got to give, and in fact if you take a look at some of the very highly-compressed video formats, like, you know, AVC or HDV or the highly-compressed audio formats like MP3, like the stuff they use for podcasts, you'll notice that there are noticeable artifacts in either the audio or the video in the first generation of the recording, and that's because there's so much information thrown away it is noticeable to the human ear or the human eye.
Now, one of the key things about JPEG 2000 is it was the first compression scheme that included a lossless compression ability. Now, one of the key facts of any digital bit stream is you can compress between two-to-one and three-to-one, which basically means if you have a hundred megabits per second, you can reduce the bit rate to between 50 megabits per second and roughly 35 megabits per second. You can actually do bit rate reduction completely losslessly, where you can completely reverse it. You haven't thrown away any frequencies in the video, any frequencies in the audio. There is no footprint to the fact that you have done compression.
And that's based around the fact that because digital signals are ones and zeros and basically they are long runs of ones and zeros -- in other words, if you actually looked at a bit stream as a long list, you know, a long row of zeros and ones, there could be 50 or a hundred zeros and then, you know, a bunch of ones and then another 50 to a hundred zeros and then 50 to a hundred ones. It all depends on the content.
And so one of the facts is you can take, let's say, a hundred zeros in a row and insert 16 bits and say – which basically one word says repeat a zero this many times in the two six –or the two eight-bit words would say repeat this zero or one this many times. And that is the way they do lossless, you know, bit rate reduction. That's called run-length encoding. And so on average you can get between two-to-one and three-to-one compression and it's completely lossless. It leaves no footprint in the audio or video. It does not in any way affect the quality of the original essence that you are trying to represent.
And so what JPEG 2000 did was it took what originally was developed as a way of compressing data files on computers – in other words, it's the same type of tool that zip files use when you send a zip file across the internet or as an e-mail attachment. It's the same exact technology. That's run-length encoding. And so JPEG 2000 was the first compression scheme meant specifically for audio and moving image representation that included the ability to do that, and that's called the JPEG 2000 lossless profile. And for those of you who actually want to look it up, it's called reversible 5 by 3.
But the fact that, that was in the standard meant that for the first time – because none of the MPEG standards, the original JPEG standards didn't include it, the original MPEGs didn't include it, Windows Media, QuickTime; none of those compression schemes included a lossless compression scheme, and so what that meant for us is we could take our very high bit rate, especially for the moving image, things that are very high bit rates, and represent them in file sizes that are half to a third of what they would be if we represented and saved the original digital content in an uncompressed fashion. In other words, we saved all the hundreds and zeros and ones in a row. And that really was an enabling technology, because that meant that instead of having nine terabyte files to represent a digital film we would have four terabyte files.
Now, you might say, oh, my goodness, those are terabyte files. They're a terabyte in size. How do you handle something that big? But the point is we're handling four terabyte files, not nine terabyte files. And when you start doing the numbers, if you start crunching the numbers on how big your repository needs to be -- in other words how many tape drives, how many hard drives, how much processing speed, how many processors, you know, how much media do you need to buy, how much power does the equipment consume -- that two-to-one to three-to-one compression ratio starts adding up as a value very, very quickly. By the time you get to the exabyte level, it'll save you a billion dollars or more all added together, and that's not a small amount of money. Kind of like what one of the Congressmen said once upon a time, a billion here and a billion there and eventually you've got some real money.
And so that's really the enabling part of JPEG 2000 is it made it also not only as a file base, which allows a whole bunch of other things to happen, but also it just flat out reduces the cost of the job in the first place.
Q. So when an analog source has been prepared, cleaned, it's ready to be transferred into this JPEG 2000 format. Is this a realtime transfer process, or is there a rendering process besides the realtime plane of it?
Snyder. Well, what's actually going on is within milliseconds of each process it starts off by getting converted to uncompressed digital. And in this case, the standard is known as SDI, which is an off-the-shelf SMPTE standard for all kinds of standard definition digital video. And then that SDI is fed into a specifically designed JPEG 2000 encoding chip which then does the actual JPEG 2000 encoding. And then once that's done, the software that they've written also wraps it in the MXF wrapper, inserts the metadata that needs to be inserted, and at that point, you have a file on the hard drive.
Q. It also sounds like an extraordinary challenge dealing with digital files, how those files can be copied.
Snyder. Well, for us it's actually fairly simple. The ultimate way of preventing people from copying things is making sure they can never see it in the first place. And so from our perspective, for what's inside our archive, the only -- at least in these beginning years – place people are going to see content that is here at the Library of Congress is going to be in the research rooms. And those are, you know, closed networks where you would have to do a lot of work to be able to even access those data networks to be able to copy anything off in the first place. And if you did manage to compromise the actual physical network up on the hill, A, we'd know about it within seconds, and, B, you know, we are transmitting things to the computers in the research rooms that are not full quality. If anybody wants to see something full quality, they're going to have to go into a viewing station where they are not going to have any access to the infrastructure that is going to be reproducing that content for them. So preventing people from making any kind of illicit copies is a big part of what we're concentrating on because 80% of our collection is copyright collection stuff, and so it behooves us to make sure that this stuff doesn't leak out.
Q. Let’s address the library's view and response to the issues of digital copying technology.
Snyder. Sure. The way we're getting around it -- or not even getting around it -- the way we're keeping it from being confronted with respect to the material we're digitizing is we're making sure that we have very, very clearly defined and very strictly designed access controls so that we don't give people -- whether it be people in the research rooms or even people who work for the library, who might have access to computers here that operate what we're doing – we make very, very sure that we offer as little opportunity to copy full-quality content as we can. And we make sure that if there ever is an intrusion that we know about it quickly, that we know who did it, and we know what they did, and what content that they got. That is a fundamental design parameter around the systems that we're doing.
Now, one of the problems with the internet and in fact with our personal computer systems as they are designed today is they are very open systems as they were designed to be. They are meant to do practically anything that you want to do in any way that you want to do it, and so if your goal is content control, you want to make sure that whatever content you put out there you can't make a copy of. And the problem is, of course, when you have something like a physical carrier like a CD, you know, you can do the proverbial CD ripping, and once it's in a form on your computer, poof, you can put it on a BitTorrent. You can put it pretty much anywhere you want to. And so our goal here is we don't offer actual physical opportunities for people to access our network so that those types of files can’t have copies made.
One of the things that the broader internet, the broader data industry and personal or consumer content distribution folks are going to have to deal with are digital rights management, but there's also the fact that the internet is so open that we don't always know when there are intrusions or who's doing what, and there are also public policies that come into this. There's a certain level of privacy, especially here in the United States, that's not only expected but in many cases is codified in law. And so there are certain things that either can't be done or can't be done the way one might want to do if you want to completely restrict access. So there are a number of different competing desires at the same time.
But, you know, they're going to have to deal with the fact that the technology is so open that it's hard to trace, and it's hard not to allow people to copy. And in fact that's why you see there's not a lot of ability to rip things like Blu-rays. You can sort of kind of rip DVDs [sic] if you know what you're doing and if you're willing to pay for the software. You can't really do it with DVDs. You certainly can't do it with, like, SACDs and DVD audio and some of the more advanced formats because they've just never developed the technology, and so that was intentional. You'll notice that you cannot buy a Blu-ray recorder that allows you to basically take component HD in and record it directly to a Blu-ray. That's not a mistake. They have designed the consumer electronics to not enable you to do it. If you want to buy a Blu-ray recorder, the only thing you can do is you can put one of those little SD chip cards in, and the only people who use those are the people who have the little HD cameras, so that means that the content you're copying to Blu-ray that is not a Time Warner film or a Columbia Pictures or something like that, takes a heck of a lot more to be able to get any content onto a Blu-ray.
And that's the key. The key is designing the technology to prevent copying as much as possible. Now, the fact is if you want people to see your content there will always be opportunities for copying. The dividing line is what is the proper amount of access to allow people to see and to pay you to see your content versus not enabling them to make high-quality copies? That is the real key… And I think that will always be an ongoing battle, that will never be completely solved in a free society. We can't do what China does, which is, demand that everybody have filtering software in their computer and report to a central authority what they're doing, when they're doing, and who they're talking to. That's anathema to what we are in America. And so I'm sure the content producers would love to have some form of that, but, that's not the nature of our society.
Q. I see. If you can, I'm very curious about your hard drive systems. I'm looking at a couple hard drives that came from Fry's computers, and there may be three or four manufacturers that all make consumer hard drive systems. What are the drive backup systems that you use to store data on, and is there anything unique or different about them than what a person might recognize as a drive backup in their home?
Snyder. Well, we do use technologies different from in the home, but if someone goes into, you know, any business's enterprise-class IT facility, the data processing center for Citibank or the data processing center for Wal-Mart or, you know, anybody like that, they will see exactly or virtually exactly the same type of technologies. It's what we're doing with them that are different.
The difference between the home and here is that, first of all, our capacities are far beyond anything you could even dream of having in the home. You know, our SAN, our storage area network, is just under 200 terabytes in size. So that, you know, that external hard drive you have on your computer for storing all of your home movies, which is a terabyte or a terabyte and a half, we've got that times roughly 150 just for staging things for our digital repository.
The repository itself is the tape robot which uses high-capacity enterprise-class IT data tape, which is rated for a 30-year life span, and we have a capacity of one terabyte per tape, although that will be increasing as time goes on. In other parts of the plant, basically the path, if you will, is each one of our digitizing stations -- in the case of moving image, it's a JPEG 2000 station. In the case of audio, it's a digital audio workstation that generates a broadcast wave format which is a particular extension to the wave format meant for very high-quality audio capture.
There are, of course, hard drives inside of those computers, those servers that actually create the original files. There are internal hard drives. In the case of audio, it's usually one or two hard drives which are a terabyte in size and are the higher-quality version of what you would get in your computer at home. In other words, they have a much lower failure rate than the ones that you have at home because they're meant to be enterprise-class IT devices. They're meant not to fail nearly as often as consumer devices do. But the concept is the same, you know, a terabyte or two-terabyte drive in each of those computers.
In the case of the JPEG 2000 encoders, we've got four terabytes -- between four and ten terabytes, depending on the machine, of storage inside of each encoder, and once the file is generated, a copy is then made to the SAN which then stages that file for quality control. We've got an automated quality control process that takes the file and puts it through a series of checks to make sure that, first of all, the essence was encoded properly – the essence, of course, being the actual audio and video content – and then all of the metadata was generated properly and the MXF wrapper was formed properly. All these various things that say by the time we actually write this thing to the digital repository and to the actual data tapes that, yes, we have high expectations that this file is absolutely dead perfect.
We also have what's called a SHA-1 Checksum, which is a 128-bit word that is generated. It's what they call a hash, and that basically is a unique footprint based on the size and the content of the digital items within the file itself. And so each file gets a unique SHA-1 Checksum. That's made a part of the overall file, and so every time we access the file, whether it be just to check to see whether or not it's okay on tape or we just copy it back off to use for some reason, either to replay to the hill or look at here or to play in the theater, the Checksum is regenerated and is compared to the Checksum that's embedded in the file. If those two don't match, it means that the file has changed in some way. That's how we know whether or not the data has in fact been corrupted; even in the smallest way has been corrupted. And that's actually a very powerful system for making sure that when we copy from one hard drive to another, when we move things around, or when we check on things years later on the tape, it determines whether or not the data is still in fact in perfect condition.
Q. Can you describe the storage capacity of the Library of Congress Packard Campus Audio Visual Conservation Center?
Snyder. We have roughly nine petabytes of installed storage. In other words, with one-terabyte tapes, that means we have a robot that has 9,000 tape slots in it, 9,000 tapes. Eventually, we will have a robot here, or a series of robots actually, you know, cabinets with individual mechanisms, where we will have 37,000 individual slots. If the technology did not get better, then at a thousand terabytes per tape that would be 37 petabytes' worth of capacity. Given the technology refresh rate of tape capacity, it's not quite the same as processor speed, but we are doubling roughly every three to five years, and so probably within 18 to 24 months, we will see on the very same tape cartridge we will be able to store two terabytes of data, which means that our capacity right then and there just by switching out the drives that record the tapes, we will double our capacity from 9 petabytes to 18 petabytes or 37 petabytes to what would be 74 petabytes. And then we see in talking to our technology vendors, we see that the magnetic tape has an upgrade path that gets us to somewhere between roughly 32 and 64 terabytes per tape by roughly 2019, the year 2019, which at that point means that in our 37,000-slot robot we could fit well over an exabyte's worth of data.
Now, one might say, “Oh, my goodness, you know, how are you going to store -- you know,” what could you possibly fill an exabyte's worth of data? Well, we've actually done the calculations on our six and a quarter million items, and especially when we start scanning motion picture film -- because, of course, motion picture film takes a much higher resolution than video and certainly far higher resolution than audio tapes, even very high-quality audio.
And so basically we're in a race with our technology upgrades. By the time our two-terabyte tapes arrive, hopefully 18 to 24 months from now, we will probably be at the 9 petabyte level. So when we do the technology refresh, we may have run out of space by the time the two-terabyte tapes arrive. By the time we get the entire robot in, which is going to be a couple of years from now when we go to 37 petabytes at a terabyte per tape, there's a good chance that by the time we build out the robot, which we're anticipating somewhere in the next four to five years, that we will have exceeded the 37 petabytes that, that entire robot would fill, and we're looking at especially as we take a look at what it's going to take to store all of the digital imaging of motion picture film. And to a great extent, we're inventing that technology right now. We're taking JPEG 2000 and we are scaling it at a point -- you know, to a rate that would truly boggle people's minds. You know, we're going to be in a race once we start doing film to get to an exabyte before we have an exabyte's worth of data. So it's getting to an exabyte is actually a lot simpler than one might think. We could very easily get there without breathing terribly hard.
Q. A lot of that data is film media that you're transferring?
Snyder. It will be. I mean, that will be by far the largest files that we're going to produce are going to be film scans, because they have between 8 and 16 times the resolution of high definition video, depending on the film format that we're using.
And keep in mind another part of our job is we're not just digitizing the collection as it stands today. We copyright submissions, which alone increase the size of our collection between a hundred and 250,000 units per year. So we're going up at, you know, six-figure rates just by the yearly submissions of copyright, not including collections that are found in a barn or, you know, whatever.
So a big part of what we're also doing at the same time that we're developing all of this digitization technology is we're actually working with the copyright partners, the major studios and production centers to develop a way of taking their copyright submissions directly as files instead of having them put it on a tape or put it on a piece of film or put it on a hard drive and send it to us. The last thing we need is more things on our shelves since almost everything in media today starts off as some sort of file. You know, a Final Cut profile on your computer, a wave file on that little handheld recorder that you just recorded your daughter on, an AVC HD file on your Sony camcorder that goes to the memory stick, or nearly everything that we create today, even the PDF of the paper that you're sending your professor starts off as some sort of file. And so why are we having people spend their money to send things that the law requires us to store until we can't read them anymore when we could develop the system to have them submit that material directly to us as the original full-quality files?
Now, of course there will always be submissions that are physical submissions, like one of the ways the major production studios register copyright on consumer releases is by sending us copies of Blu-rays and DVDs and CDs and video games and all sorts of -- I shouldn't call them video games. Computer games or whatever you want to call them. Electronic gaming. They send us copies of that all the time because that physical asset itself is eligible for copyright, the very fact that it is the package that it is and the content is recorded on that package in a certain way.
So we will never eliminate physical submissions, but the vast majority or at least 50% of what we get submitted every year we think we can move to file-based submission, which basically means that not only are we going to be increasing the content because of what we are digitizing in our own collection as it stands today, we are increasing the number of files that are being submitted by the people who are submitting things for copyright in six-figure amounts every single year.
Q. So wouldn't it be true to say that you're expanding the scale? And it seems there's also a bit of an urgency here to archive these analogs versus now while there are still around.
Snyder. Yes. In fact in some cases, you know, luckily not too many, but we do have some cases where some of the media has degraded to the point where it's hard to get a good copy off. We are in a race against time for some of the oldest video and audio tapes because they are decaying and we may not get to some of it before they're gone. And that would be a real shame. But also -- we're also in a race against time developing all these technologies. Keep in mind everything I've described to you in the last hour and a half has come to pass only, or the technology at least, has only really come to pass in the last five years.
So we really are at the bleeding edge of technology. And so, you know, some of it is funding. Some of it is staffing certainly. And if folks want to help us out on that point, please write your congressman and tell us [sic] to throw us more money.
But some of it is also -- and a fair amount of it is also the fact that we are at the bleeding edge of the technologies that we're using. We are pushing pretty much every envelope there is, and there's only so fast you can go. Discoveries only happen so fast. You only learn things so fast. And so a lot of it is, given the timeline that we have, we're doing the best that we can, but, you know, we can only do it so fast.
Q. You just mentioned some of these early and rare prized pieces of film and audio. If you're familiar with them, can you name a couple off the top of your head so people know what kind of material that is in danger on that list?
Snyder. I think in terms of what kind of tape was recorded on and what time period was that tape recorded and how was the material stored. That's how I think when I think of how in danger recordings are. And so if I find something on a particular type of tape -- and I'm not inclined to embarrass any manufacturers even though they may no longer be with us -- but there are certain types of tape where at the time it made perfect sense to try a new formulation, but unfortunately we've learned in retrospect with the benefit of hindsight 30 or 40 years later, we've discovered that these new tape formulations back in the '50s and '60s are not aging terribly well.
And so you really you can't place blame. You just have to understand that the folks at the time who don't know what we know now made some relevant decisions for the period and we have to deal with those decisions in an expedited fashion. And so pretty much anything that's recorded on those particular tape types that are the newer formulations, videotapes especially where there was a change in formulation roughly in 1962 when the original tape formulas, which were 3M, Scotch, and Ampex were the very first original videotape format. Those old light brown videotapes have actually turned out -- the very first videotape ever created turned out to be exceedingly hardy. And so the first six or seven years' worth of video recording is actually surviving better than the stuff that was recorded in the mid-1960s when the tape was reformulated.
And so we have to identify in our collection what's on these endangered tape formats, and some of these things change, sometimes where you might have a tape formula that was changed and they actually experimented from production run to production run. So even though if you have a given reformulated style of tape and it had the same brand name on it and the same tape brand on it, like Type X as opposed to Type Y, sometimes production runs within Type X can be different where one production run of Type X is perfectly fine and another production run of Type X where they changed something very slightly but in 40 years it turned out to make a big difference. That's actually kind of the whack-a-mole that we have to deal with. We're figuring out what are the most endangered pieces. And so pretty much anything recorded on that tape from that era is potentially in danger and we have to take a good hard look at putting it at the front of the line.
At the same time, we've got some other things like how the Congress requires us every year to name 10 to 20 sound recordings, motion picture recordings, video recordings that are of historical note, and those are automatically put at the top of the list no matter what the original media's quality or lack of quality may be. And so there are other decision-making processes that are a part of it.
Or if NBC decides that they want to go back and actually pay the money to have us recover some television series like Bonanza or Laugh-In or whatever because they want to release it on home video or they want to make a high definition version of it or whatever, tends to move it up. It won't move it to the top of the list, but it will move it up in the preservation cue because NBC's willing to reimburse us the cost of doing the work now. And so there are a couple of different decision processes that go on, and all of them are relevant.
Q. What do you see the future goals of the Audio-Visual Center as being as your collection grows?
Snyder. Well, our goals at this point are a couple different things. For public access, it's to allow anybody who comes into the Library of Congress's building, any of the public access areas, to be able to search for and see or listen to pretty much anything that's in the collection, you know, within the Library of Congress's zone of influence, in other words in our buildings on Capitol Hill, the research rooms on Capitol Hill.
In the case of the Congress, some of that content is also permitted to be transmitted over all of Capitol Hill, because we do serve the Congress and they do use our facilities for research purposes, so some of that content may go beyond our physical buildings for congressional purposes but not for public purposes.
For the parts of our collection where the content has already gone out of copyright, in other words it's a hundred years old or whatever the reason may be, the copyright has expired, that content, our goal is to put it on the public internet. In other words, just the same way you'd sit down in the research room, search for something, and then watch it, anything that is cleared for copyright potentially could be viewed over the public internet. Now, that's still a few years away because we've got a whole bunch of different infrastructure projects to get through before that capacity is there, but that's the end goal, and hopefully we'll get there.
For the copyrighted material, there may be on the odd occasion points where the copyright-holders say, “Well, we're fine with you guys doing the streaming-quality version of our material for research purposes over the public internet, so yeah, you can go ahead and do a YouTube-quality video for certain types of our material, and maybe we just want you to put a chyron on it that says ‘copyright 2009,’ you know, whoever, ‘for research purposes only.’” You know, a bug gets put on it so, you know, if somebody manages to copy it, which we're going to work very closely to make sure they don't, but if they do manage to copy it, at least it's got a watermark on it that's very, very clear.
And if we have permission, we'll go ahead and stream that content as well, even if it's copyrighted, and then for the parts of our collection that are out of copyright, we'll just go ahead and make that searchable on the internet at some point as well, and viewable and listenable.
Q. Thank you Mr. Snyder for sharing your extensive knowledge on the Library of Congress Packard Campus Audio Visual Conservation Centers work. This has been a wonderful interview.
Snyder. Certainly. For me the two terms that have a very different meaning for me versus everybody else is the term "longevity,” because the authorizing legislation for this said that we have to keep the archive extant for the life of the republic plus 4,000 years. That tends to give you a very different perspective on longevity. And then one of my favorite terms -- you know, something from Saturday Night Live, the old Coneheads saying, "consume mass quantities," I have a very different perspective on consume mass quantity. So it's an interesting job. I get paid to do cool things.