>> From the Library of Congress in Washington DC. ^E00:00:04 ^B00:00:22 >> Mark Winek: Welcome to the annual meeting of the Potomac Technical Processing Librarians. My name is Mark Winek. And I've had the pleasure of serving as the PTPL Chair for 2015. And, today marks our 91st meeting as a group devoted to the technical side of supporting the informational needs of school children, university students, the legal profession, religious communities, the military, the marginalized, and everyone in between. And, before we get the days proceedings underway, I'll also get the thanks out of the way. And, I would like to express my deepest debt of gratitude to the 2015 PTPL Advisory Board. If you're on the board, could you please stand up just so people can recognize you. Thank you. Not only have they made each meeting a pleasure, they've each provided their unique talents to make PTPL a stronger organization in this year and for the future. And, I'd especially like to thank the Executive Committee, Vice Chair Tiffany Wilson, Past Chair Linda Geisler, Secretary Sue Neilson, and Treasurer Linda Wirth. We've especially to thanks Linda Geisler. Linda returned to the PTPL Advisor Board when we were in need of an executive two years ago. Not only has she provided her services faithfully, she used her good offices to insure that we had a monthly meeting place here in the Madison building and took the lead on coordinating the space and catering required for today's meeting. And, I also want to thank Bobby Reeves, our webmaster who keeps the website and the conference registration running year after year without any complaint. And, Keri I hope that's reflected on his annual review every year. So, for your convenience, Wi-Fi is provided by the Library of Congress. The network name is locguest while and you won't need a user name and password but you will have to consent to the terms and conditions. Restrooms are located out the main entrance of the Montpelier room across from the elevators down the hall on the right. There's a scheduled 15 minute break in the proceedings at 11am and lunch will be served from 12:15 to 1:15 in this room with vegetarian options available. And, if you don't keep to a vegetarian diet, we ask that you insure that there are plenty of such options available for those who do. And finally, just for the next few minutes, there are signup sheets available at the registration table, the optional Library of Congress tour immediately after the meeting. And so, the tour is complementary but the space is limited and we have to provide the numbers to the doses early so please make sure and preregister if you are interested in that. So, without further ado, it's my pleasure to welcome our keynote speaker Dorothea Salo. Dorothea is a Faculty Associate in the School of Library and Information Studies at the University of Wisconsin Madison. At Madison, she teaches organization of information, database design, and surprising, and surprisingly XML and linked data. She has presented internationally on linked data and data curation, among other topics. Prior to her move to Wisconsin, she spent two years as a digital repository services librarian at George Mason University in Fairfax. And, please join me in welcoming Dorothea Salo. ^E00:03:59 ^B00:04:19 >> Dorothea Salo: Turn this this way. We're doing the usual switching laptops today while we're. While we're waiting, I, actually, personally would like to thank Linda, thank Linda for hosting me. It was very, very gracious and I so appreciate it. and, thank you, all of you, for the wonderful welcome I've had here today. So, I am Dorothea Solo. And, I am not nor have I ever been a cataloger. Though, I may be, I'm not sure, I may be one of the last generation of library school students who's actually required to take a cataloging course. I don't know. I just know we, at the iSchool, AW Madison are doing one of our periodic major curricular revisions. And, current odds are that we're keeping the core organization information course because that's core. We're just changing it to deemphasize MARC cataloging a bit in order to include more material that's relevant to non-MARC environment because that's where an increasing number of our students are going. So, interesting times, not that they're ever not. So, curriculum revisions, they are never fun. They eat everybody's time for months on end. They always cause bureaucratic hassles out the wazzoo. For a while, you have to deal with two different incompatible curricula. And remember, which are, which rules apply to which student you're trying to advise, next week's advising week, so it's on my mind. And, it's, it's a mess. So, why do we do this to ourselves? Right? Why do we bother if it's such an awful hassle? And, this is my answer. ^M00:05:48 [ Speaking foreign language ] ^M00:05:50 Because it must be done. Just, in Latin it kind of sounds cooler. We at the iSchool cannot just sit back and do what we've always done just because we've always done it that way. Not when the world that our graduates are needing to fit into is way different from what it was when we built the old curriculum. And, it doesn't mean we did a bad job on the old curriculum. I don't think we did. It's just that the world has changed out from under it. So, we have to change the curriculum. We don't have to enjoy it. We just have to do it because it must be done. Not entirely coincidentally, that's kind of how I feel about the move away from MARC. It feels, to me, like a lot of library professionals spend, at least, half a decade on the question. ^M00:06:46 [ Speaking in foreign language ] ^M00:06:48 Or, why must it be done. If you know this question sounds really cool in Latin, I have completely lost patience with it. Don't even tell me nobody's asking this still, I straight up heard it just last May at the conference. Totally still out there floating in the water. And, if you like water, here's some nice water, nice harbor, really pretty, love the little lighthouse. But, the thing that isn't in this pretty picture of a pretty harbor is a ship because the why can't we use MARC ship has sailed. It has sailed. Not even having the why do we have to change discussion today. I don't see the point. That ship has sailed. I mean, and that's just program planner as it happens for the IT division of the Special Libraries Association. And, I was talking to SLA's technical services planner, Betty Landesman, maybe some of you know her. She's local to this area. Was talking to her about a linked data session. She rolled her eyes at me and she said can we please not, right, do yet another intro to linked data and why it's better than MARC? Can we not? I've seen a ton of hose and they don't help. Okay then. If I've got catalogers yelling at me not to do this, I'm going to try not to do it. Because, really, the answer is exactly the same as it is for our curriculum revisions at the iSchool. ^M00:08:09 [ Foreign language spoken ] ^M00:08:10 Because it must be done. I do want to mention, though, because I come at technical services from an XML and metadata background rather than a MARC background. But, it isn't just MARC cataloging that the bell is tolling for here. There's zero chance that XML based metadata practice is going to stay the way it is today, not going to happen. I already see it changing. Not even sure XML is going to stay alive as a pure metadata format as opposed to uses like TEI for the visual humanities, EAD for archives where you're dealing with narrative type documents that are intended for mostly for human beings, that's where XML really shines. I think those will survive. But, XML for metadata, not sure. And, you know, I'm actually okay with that. I'm okay with XML's decline as a metadata serialization, I never liked my nice clean elegant document standard getting worked over by the data engineers anyway. Don't even talk to me about an XML schema. Maybe now I can have XML back for documents, the way it's supposed to be. Now, I'm actually much more interested in this question. ^M00:09:23 [ Speaking in foreign language ] ^M00:09:27 What actually is it that must be done? What do we have to do to our catalog data and our other metadata so that it works in this world where so much has changed about how people find information? I like this question because it's pragmatic. I like this question because it's intriguingly complicated. I like it because it's nervy in all the best ways. I like it because I'm an inveterate fiddler with things, and there's just great huge masses of MARC and XML right there waiting to be filled with. And, it's just another of these questions that we got to work on or we just stay stuck. Right? And, I don't think it's enough to just say well what we have to do is we have to migrate our data from MARC and mods and mets and the various cores, Dublin core, Darwin core, VRA core, PB core, all the cores. We have to migrate all that to link data then we're done. That's skipping all the steps. Right? that's an underpants gnomes approach, you know, migrate all the things to link data, step two question mark, question mark, question mark, step three profit not. That's not enough. That's like saying we have to pick up some rocks and then we'll turn them into a giant mosaic. Not enough information there. What's the mosaic design going to be? Where's the mosaic going to be built? Where do we find the right colored rocks and how do we cut them up if they're too big or they're not the right shape? How do we glue all the rocks down? What happens if somebody glues down a rock in the wrong place? What if there's an earthquake? So yeah, we've got to work on some process here, I think, for mosaics and certainly for our data and metadata as well. And the other reason I don't think it's quite enough to say what we have to do is migrate everything to link data is that it assumes without proof that link data is the ultimate destination for all of our stuff. Which yeah, now it's the horse to bet on, I'm certainly not denying that. But, I think link data gets used as a stocking horse, sometimes a little bit of a scapegoat. It's link data's fault we have to change away from MARC. It's link data's fault all of these changes are happening to begin with. If it weren't for link data supposedly being the new hotness, we could just stay the way we are and everything would be fine. And, I don't actually think that's true. If link data didn't exist, and believe you me, there are times I would love to wipe RDF off the face of the earth, if there were not link data, we'd still have to make changes in how we collect and organize our catalog data and our metadata. And, we have to make those changes for the same reason that we're changing our curriculum at the iSchool. The world is just plain changed out from under the old ways. And, that didn't happen when the Library of Congress or the British Library or the National Library of Spain or the wonderful folks at Oslo, that didn't happen when they announced their link data plans. The changes happened a lot earlier than that. They happened, these changes, when paper cards gave way to the web as the main way the patrons interact with library catalogs and with digital collections. And, yeah, yeah, if you're, I know there are people who are medievalists in here. So, I cheated, the left hand picture there is not actually paper, it's papyrus. But no, that's when it happened. And, I mean, it's not that we didn't notice. Of course we noticed. How could you not? It's just taken a while for us to figure out what we need to do about it. Oops, which I don't know how to say that in Latin, but, oops. I think we maybe waited longer than we should have. But that's, you know, that's water under the bridge now. Leads me right back here. ^M00:13:44 [ Foreign language spoken ] ^M00:13:46 What do we have to do now that the work that we do has to play nicely with computers? And, not just computers. I mean, MARC was designed through computers, but networked computers, computers that can talk to one another because the network really does kind of change the game. We're going to start with my students, when I teach our core organization information course, I start here. ^M00:14:14 [ Foreign language spoken ] ^M00:14:17 Which basically means computers aren't real bright. I mean that. Computers are not all that bright, I tell my students. You are way smarter than a computer. And, I say this for a lot of reasons. One reason is knocking computers off pedestals, not actually literally knocking computers off pedestals, though, hey that would be kind of cool. But, you know, what I mean. Right? A lot of my students come into the iSchool thinking that computers are small gods. They're magical, they're capricious, they're inscrutable, they're liable to mess you up like Diana's about to mess up that antelope there. And, I'm saying, I have to get them to not think that because the more they understand about how computers do work, the better off they are and the better off we all are. But, actually, the main reason that the notion the computers aren't real bright is relevant to this talk today is that, from our point of view as literate human beings, computers are not too bright in some very specific and fairly easy to understand ways. And, those ways tell us, actually, pretty clearly what our catalog metadata and other data have to look like if we want computers to work effectively with it. And, really, this is no different from how the shape and size of catalog cards and the standard size of type written lettering shaped how the MARC record had to look. The technology that you have available in the card catalog, totally technology, don't let anybody tell you it isn't, the technologies available to you shaped how it makes the most sense to do things because different technologies are good and bad at different things and need different things to function best. That's basic design theory. If you haven't read Donald Norman's Design of Everyday Things and a couple of its sequels already, please read them. Brilliant stuff that will, it will explode your brain in all the best ways. So, the first thing to remember about computers and I hear, I see some people already smiling, they know where I'm going with this. The first things to remember about computers is the text, right, the ordinary stuff that we write every day for other people to read, the text that we literate human beings read and comprehend so fast so easily so accurately that we hardly have to think about it. It is all Greek to a computer. Right. You knew that cliche was showing up at some point. How could I not? Computers can't read. They are functionally illiterate. If anybody in this room has a kindergarten age child at home, your kindergartener reads and comprehends text immensely better than a computer can. So, my head, one step toward coping with illiterate computers, is dealing with our addiction to textual notes. By way of example, I took these MARC 504s straight from the Library of Congress' MARC documentation. And, thanks for that, by the way. If anybody here's is in anyway responsible for it, it is super helpful in my classroom. But, for example, if a patron question that we would like our catalogs to be able to answer is hey I'm new to this topic, can I get a recent book with a good bibliography please. Right. Perfectly reasonable question. These notes, being free text means that our catalogs can't actually come up with an answer. Because, to get to an answer means filtering a list of books by whether they have a bibliography or not. And, to do that with MARC notes, the computer has to read the text. And, it's got to understand that bibliography and bibliographies and bibliographical and literature cited and sources and maybe whatever that Romanized Russian is. I'm not even sure I didn't look it up. It's got to figure out that this all means yes there is some kind of bibliography in this book. And no, a computer can't just look for the existence of a 504. That's not enough because 540s don't just note down bibliographies, they also note down indexes and maybe some other information. So, straight up, the computer is not bright enough to figure this out. It can't read, much less read all the languages that we end up transcribing stuff in especially here at the Library of Congress. Right? Much less, comprehend what it's reading. And, that makes a lot of the stuff in our mark records a lot less useful to patrons than it could be. Computers get yes or no. Yes or no, on or off, this they get. They're real good at that. Checkboxes are candy to computers. Computers love checkboxes. So, for any conceivable criterion that we want our patrons to be able to search on or filter their results on, we pretty much got to quit recording it in text and make it a checkbox or, you know, radio buttons, if there's more than two options that also works. If you get the sense from this that I really like MARC fixed fields, you're totally right. Though, honestly, that's your thing. That's a little weird. If a lot more of MARC had been expressed in fixed fields instead of free text, we'd be in a much, much better situation right now. Now, let's say for a moment, let's pretend that the titanic arguments that we in the profession are going to have to have about when a book is said to have a bibliography and what actually counts as an index, let's say we've done that and we've drawn the best line we know how. We're still going to have to deal with this giant horrible mass of free text that's hanging out in our 504s in our catalogs right now because computers aren't bright enough to understand it. ^M00:20:40 [ Foreign language spoken ] ^M00:20:41 It's got to be done. And, I don't want to get down to the weeds on this. Actually I'd love to if we had the time because this is exactly the kind of problem I salivate over. I'm a giant nerd though and not everybody is. So, I just want to say this class of problem can be solved for the great mass of our records without hand editing. And, of course, it's got to be because there ain't enough staff time in the universe to check all those checkboxes. And, it's a thing that's got to be done for every characteristic in our free text notes that we want users, patrons to be able to filter or search on. Another serious and ugly free text problem we have in our records has to do with places where our content standards don't force us to be consistent about how we record information, you know, what as, we're cool, just type something. It's fine. And, on catalog cards, this inconsistency didn't matter. Right. Because, the information was only ever going to be looked at by a human being who doesn't need the consistency to figure out what's going on, on the individual card. We're literate humans. We're smart. We can figure this stuff out. Computers are astoundingly literal minded, right. You can take some text, add a space to it, and all the sudden, to the computer, it's a different thing. Just a space, right? A human being wouldn't even be able to see it. But, to the computer, new thing. Once again, not just a MARC thing, consistency and Dublin core metadata, don't make me laugh. I actually am more likely to cry. It's really that how dare. And, I once got an entire published article, cataloging classification quarterly, you can look it up. Out of one poor soul in the Institutional Repository that was running at the time, the poor guy had stuff under eight slightly different spellings of his name, eight. Just terrible. And yeah, I fixed it as soon as I had the screenshot I made through the article, I wasn't just going to leave it like that. But, it goes to show, doesn't it? And I want to call out two Dublin core things specifically noting that you're going to find these problems a lot more places than Dublin core. Dates, oh my gosh. Oh my gosh dates. Dates are really important to information seekers. So, it's really important that we record them consistently such that a computer can reasonably intelligently filter based on them. And, oh boy, we're not there yet. We're not even close to there. Dublin core, MARC, totally doesn't matter. The people who try to make not too bright computers work are tearing out their hair about the way we do dates. And, we've got to fix this. ^M00:23:34 [ Foreign Language spoken ] ^M00:23:36 It's got to be done. As for write statements, this is another one that comes from the digital collection side more than the catalog side. It's important because the people who come to our digital collections often want to use them in particular ways and they want to know if they're allowed to. We have to be clear about what our patrons are allowed to do. And, to do that, our search engines have to be able to tell which users can do what with which items. And, that's all free text now and it's a complete disaster. Europeana and Digital Public Library of America are working on this, thankfully, bringing a little bit of reason, I hope, to this chaos. I don't not envy them that job one big. Copyright's bad enough, free text only makes it worse. So, an example of the difficulties with free text that I use in class a lot is from Library software developer Bill Dueber who took a close look at what was after the ISBN in the 020 field in the catalog that he was working with. And wow, it's horrific. Just the top 20 responses over there on the left, you can completely see the inconsistency. And the more you drill down on the right he took a look at anything that was kind of sort of hardback and he got all those. Yeah. The more you drill down, the worse it gets. So yeah. Our catalogs can't answer the very simple question yo this book print or electronic or both? At least not based on the 020. And yeah, I know RDA fixes this and yay that's a good thing to have fixed. Bottom line though, a lot of our catalog data is just hopelessly internally inconsistent. And, sometimes that's material for patrons and sometimes it isn't. I'm not saying that this is the top thing that we've got to fix in our catalogs. I don't actually know what that is. That, that knowledge, that expertise is with you. But, when it is material, right, this is my call to control all the things, all of them, anything that's useful, material to patrons, in a record that is not actually transcribed off the item, needs a controlled vocabulary, or, you know, other appropriate standardized expression. You can't do a controlled vocabulary for date, you just have to be really clear about how they're written down, I can't with this. what? What even is this? And computers is even worse. What is just type something is not okay in 2015. Transcribe it or control it. There is no third option. ^M00:26:23 [ Foreign language spoken ] ^M00:26:25 And, since I've used to the word transcribe, let me just say this thing where we intentionally transcribe typos, and other errors in records and information that's material to a patron searching a browsing, can we, can we not please? Can we not do that because computers won't know. Computers will happily operate on the wrong information not knowing that it's wrong. So, we got to stop propagating mistakes. Consider it a service to publishers. Poor guys, as well as our poor patrons that done the 020 field for another reason that has nothing really to do with cataloging practices and more to do with ISBN's. Now, I can see my org of info students with this. I can't fool you. I want you to know I know this. I can't fool you with this question is the ISBN a good book identifier? And, we all know it's not. We know that tons of books don't have ISBN's. We know that ISBN's accidentally and intentionally get reused. We know that it's not totally clear what a book even is in ISBN land. It's kind of an addition but not really. It's kind of a format question but not really. It's all very, very confusing. Perhaps, predictably, it confuses computers too. Because, computers are so little minded, they need to be really super unambiguous when talking about what kind of thing something is. They have to have a really super clear definition. This is a book. This is a print book. This is an eBook, you know, whatever. If you and the computer have a different definition, different idea in your head about what a book is, well the computer is going to do wild random unexpected and unpredictable things from your point of view. And, you know, the computer is happy to use whatever definition we come up with. It doesn't care. But, in spite of Ferber and sometimes it must be said because of it, we don't really have clear definitions here that we can operationalize in our systems that don't lead us into logical contradictions or really bad edge cases. And, we kind of need that. So, that's one thing. We got to figure out what exactly we're talking about when we say things like book and eBook and hardback and so on. So, we can explain the distinctions clearly to the computer. And, if this reminds you of Susan Briet trying to explain when an antelope is a document, and when it isn't. I'm right there with you. It's totally going to be weird and sometimes theoretically like that. And I have to say, this mosaic is kind of like my personal head cannon for this is how Susan Briet looks to me now, you know, whatever works. And then, once we've figured out the kinds of things that we're talking about in our, in our universe, we have to be able to point. And, excuse me for being rude, but, be able to point clearly and unambiguously at every single example of those things that we, collectively, as a society, I will say, that we have because that makes it easier to pull the information about these things that we, collectively, have. The network can work for us if we let it. But, to let the network work, the network needs a common way to point at things and understand what's being pointed to. And, for a computer, that means an identifier that, unlike an authority string, can never ever change. Computers get very confused by change, even more confused than the rest of us. And, for a network computer, that means an identifier that's unique not just in your organization, your library. So, your call number, bar code number, that's not going to work. It's got to be unique worldwide. So, that means another thing. It can't be language dependent. Because, you know, we've tried already collating our records by fuzzy matching and trying to deduplicate, that's how Metasearch worked. But, we pretty much, all know that Metasearch never worked all that well. Computers are not bright enough to fuzzy match well all the time. And, catalog data and typical metadata, they're sparse enough that they're not good candidates, there's not enough there for a computer to fuzzy match anyway. We're still going to have to use this just to get started because it's what we've got. We don't have anything better. But, for some stuff, for some catalogs, the identifier thing is going to be a long haul. So, that means we've got to have unique identifiers for our stuff that are way more reliable than ISBN's. And, if you know the link data world at all, you already know that the scheme that's been settled on for these is URI's which 99 times out of 100 look like URL's. And, the reasoning there is we already know how to make URL's global unique because we already do that or the web wouldn't work. That's all it is. It's just keeping the computers from getting confused. And so, another problem we have to solve, if we're going to take advantage of the network, is taking identifiers and quasi identifiers that we've been using for things like ISBN's like authority strings and matching them up with URI's URL's that's been established for those things, not going down in the weeds here not even the seaweeds with the dolphins. But, I do want you to know, if you don't already, the nerd word for this process is reconciliation. Nice word. And, it can be partially automated if you know your source data well as catalogers tend to do. And, the win here is that once you have a URI for something, you can go out in the network and ask a whole bunch of trustworthy sources, hey, what do you know about this thing and get back answers. To me, that's what we now think of as copy cataloging is going to work. You ask tiny little questions of very reliable sources, get tiny little answers back, and you build them up into your search engines and your facets and your browsing tools, and all the other user interface chrome that we're already familiar with. And, it won't have to be done by hand. If you tell a computer every time a feed you a URI for a book we just bought, ask this question of that source in this way and store the answer there. And, it will do that reliably and consistently every single time. And, I certainly believe this will be a better solution to what I think of capitalized as the problem of vendor records. You're familiar with this problem. I don't have to elaborate. What I'm saying is the problem of vendor records is nine times out of ten a problem of vendors struggling not only with ACR2 and MARC which are hard enough, but with MARC practices that are inconsistent across libraries to begin with. It'll be a lot easier for us and for vendors if, instead, our computers are asking their computers a lot of little tiny questions with little tiny answers. I'll at least say one last thing about the 020 okay. ISBN's not unique. The field includes inconsistent format information, yeah, yeah, we got that. Here's my question. What the ever living heck is format information doing in an ISBN field to begin with? But, much less information about volumes of much less information about volumes and multivolume series. And, if it does have to be there, and, I know, I actually do know why it's there, can't it at least be in a separate subfield? What is this business with the parentheses? I'm showing a mosaic detail here for a reason. You can see all the little tiny individual rock bits and you can see how very carefully they're placed and that none of them overlap anywhere. But that's what our catalog data and metadata need to look like, no overlaps, nothing just jammed together willy-nilly, everything in little tiny bits and each tiny bit in its own singular place very carefully set apart from the other tiny bits. It's called granularity, sometimes atomicity, if you're a relational database nerd and computers love it. See, this is the thing about computers. Computers are really, really good at building up whole mosaics from little tiny pieces. They're good at that. What they're critically bad at is taking a whole and breaking it down into parts. We have to do that for them. And, as we saw with the 020 field, we often don't. Or, if we do, we do it in ways that the computer finds confusing or inconsistent. So, I grabbed a couple of examples from IFLA's ISBD supplement. And, again, if anybody here from IFLA, thank you for this, classroom lifesaver. Just to show the difficulties, think like a computer for a second. There's a whole lot of punctuation all up in here. And, it's not at all obvious, even to me, and I'm a human being, what all of it means, what's being set off by it or even if it means anything at all. I mean, just the physical description, the globe there, tell me what a period or a dot means there. What is space globe space colon space call dot coma space plastic space semicolon space 30 cm space parentheses IM dot parentheses. Okay. As a human being, I can figure out that the dots there are calling out abbreviations. Right because I'm a human being I'm smart that way. Can I just tell the computer now to assume that a dot always mean an abbreviation? Of course I can't. Because that's not what it means in the other fields. And, can anybody tell me why, in the last area of the globe description everything except the final sentence has a dot at the end? It's enough to make a programmer cry into her beer. Wisconsin, beer's a thing. I love the area labels off here for legibility. But, I just want to point out, we actually have two competing sets of delimiters happening here. The areas which are set off delimited by white space and what's in each individual area, which is funky punctuation city, and, when you add that to MARC, we get a whole other set of delimiters in the form fields, subfields, indicators. I respect my colleagues who teach cataloging. Well, I could never do it. I cut my teeth on XML where delimiters are totally cut and dry. The validator will tell you if you screw them up. This mismatch completely bewildering to me. Bottom example there, anybody see an error? I mean, I think there's an error. It's kind of hard to tell. But, that very bottom line, if that's supposed to be an m dash before the word canyon, it actually isn't. Human being, no big deal, to a computer, very big deal. Just goes to show we can't even be sure documentation is right. So, anybody who works with XML now is twitching [inaudible]. Sorry about that. But, it illustrates an important point about delimiters. A lot of my students find learning markup HTML XML really frustrating because they've never had to be 100% consistent about delimiters before. So, they make tiny little mistakes like leaving off an angle bracket, and they haven't learned to look for those mistakes yet. They don't understand what the validator, which notices the mistakes, is trying to tell them. It's really frustrating for them. So, what I tell them is suck it up and deal. Okay, no, I'm not actually that evil about it. I'm pretty careful to point out the kinds of errors that beginners usually make and I tell them that everyone makes those errors even really skilled and experienced people. And, it's okay, the whole point of validator is to help us get it right. Well, fundamentally, yeah, they got to learn to deal. And, they don't like that but too bad. ^M00:38:35 [ Foreign language spoken ] ^M00:38:37 It must be done. Don't confuse the computer. Reliable and consistent delimiter use is how we avoid confusing the computer. Delimit in just one way, delimit clearly, delimit unambiguously. And, you know, I have to show this thing that absolutely broke my heart. I was digging in the history of MARC ISBD internationally, it was just a rabbit hole I fell down. And, it turns out, a while back, the Germans were totally bent on killing ISBD punctuation out of the MARC record relying on MARC delimiters only. Which, from the point of view 2015 totally would've been the right decision. But, how could they know English speaking MARC went the ISBD direction instead. And, 20/20 hindsight, what can you do? So, what is it that we got to do now? ^M00:39:28 [ Foreign language spoken ] ^M00:39:33 We got to get a handle on our free text issues. When we're saying the same thing more than once, we need to say it the same way every time. We need to atomize our data, make it as granular as it can possibly be. The delimiter thing, sometimes we have too many and sometimes we don't have enough. And, it would be nice to find a happy medium there. And, when we can identify something as well as labeling it for our patrons, we should because identifiers are labels for computers. Identifiers make network computers happy and useful. And, just to reiterate, we don't have to do these things because link data. We have to do these things because relational databases, because search engines, because faceted browsing, because internet, because web, basically, because not too bright computers. No secret. Cards on the table here. If we do do these things, we'll be a whole lot closer to link data. So, it's not like I'm ignoring that. Not at all. I'm just saying link data is not all that this is about. So, that's great. Those who are in this room that are a bit younger, you've got jobs for life here. I tell my students, this is your career, this migration, this transition is what you'll be spending your career on. Plenty of work. So, how do we actually do it exactly? Good question to have an answer to. By hand, one record at a time, not the answer. It can't be. We've got too many records. I mean, yeah, there's still going to be weird outliers that we end up fixing by hand. There always are. Oh boy I could tell you stories. I'm sure you could tell me stories too. But, we need to throw computers at fixing the easier problems and limit the hand work to those weird outliers. We've just go too much data to do it any other way. And, because we live with networked computers, I'm not totally sure that doing it one organization at a time is quite the way either because, in my head, and I lived this with institutional repositories, it kind of means that we're all solving the same problems redundantly in parallel and we don't need to do that, necessarily. It costs too much and it's going to take too long. But, the truth is, we collectively don't yet have the knowhow that we need to collaborate on this. Most computer programmers don't have the depth of MARC and the ACR2 knowledge that they need to fix these problems. And, most catalogers, I won't say most, but, many catalogers don't yet know how to work with a great mass of records as opposed to editing one record at a time. So, this particular problem, this collaboration problem. ^M00:42:58 [ Foreign language spoken ] ^M00:43:00 It is solved by learning. Catalogers, developers, we've all got some learning to do. And I mean, I would say that. Right? I'm a teacher. But, I am a teacher because I believe this. It is solved by learning. Managers in the room, supervisors, please, give your people time and space to learn. Do not make me yell at you about this because I will. Straight up, your cataloging backlog or that new tech thing that your developers are currently working on is far less important than this strategic preparation for what's coming down the pipeline at you. By all means, hold your people accountable for actually learning. I'm in favor of that. Some people will say they want to learn and then complain endlessly about it, I can tell you stories and you can probably tell me stories too. Hold them accountable but let them learn. Help them learn. Learn yourself. It won't kill you. It might make you stronger. Here's some tools that I think are well worth adding to your tool kit, if they're not there already because they're designed to fix stuff in lots of records at once rather than or possibly in addition to one record at a time. I've ordered them by the order in which I would recommend that a cataloger learn them. And, the last half, you may not even need. I mentioned these because there's situation where they're genuinely going to be useful. Start with MARC edit, it's the workhorse. It's got an increasing number of really good batch record editing tools. I don't know that I need to say anything about that. Most of you in this room probably already, at least, have heard of this tool. Next, who I recommend is open refine. You may see it called Google Refine as well. There's a variant on open refine that's specifically designed for getting to and from link data. It's called log refine for linked open data refine. And, you have to install it anyway, so, you might as well install the one that's ready for linked data anyway. Regular expressions are kind of the first nerdy thing I ever learned. So, I'm kind of over fond of them. Regular expressions are search and replace tools that do pattern matching in very, very nifty and sophisticated ways. So, if you're interesting in learning regular expressions, I found this wonderful site and I actually have my students using it now. It makes me very happy. regex1.com series, a very, very nicely presented tutorials getting you used to regular expressions, strongly, strongly recommend it. So, last three. If you have to extract data from relational database or put data back into a relational data base. Structured Query Language, SQL, is the language that you learn to do that with. We teach a lot of it in this semester. It's not, you know, if you can handle MARC and ACR2, you can totally handle SQL. If you have a lot of XML around, there is a little language, if you will, that's designed to take XML and turn it into other stuff, including linked data, if that's your thing. It's called XSLT. The learning curve there is steep. I recommend looking at other people's code to get started. But, I actually love XSLT, I'm weird that way. Last thing, this is relatively new and changing and I really don't recommend starting here because here there'll be yaks. And, if you've heard the phrase yak shaving from your developers, yak shaving is all the really annoying work you have to do before you can actually do the work that solves your problem. Right? So, installing Katmandu in fixed, there's a lot of yak shaving there, I will warn you. But, Katmandu in fixed come out of France. Fix is a little language aimed at batch editing metadata, really, really interesting little project. I think it will get easier to use. And, therefore, I think it's worth keeping an eye on even if you don't dive into it right this second. So, how should you learn and what should you do with what you learn? In my head, those questions are intimately entwined. In some circles, I'm known for the phrase beating things with rocks until they work. And, if you use rocks for mosaics, this probably won't work, they're too small. But, you get the idea. I think plain or mucking around and breaking things and fixing them once you break them is the best way to learn new tools. It's certainly how I do it. And, it's certainly how I kind of dump my students in the deep end and have them do it. Pick something to so with the tool, then do it. And, if a few rocks get beaten on along the way, it's all good. What should you pick to do with the tools exactly? Well, try cleaning up your catalog data. Right? Fix your 020's with MARC edit. See if you can at least make those format notes a little more consistent. Export your data to something open refine can read, it's doable. I know catalogers who have done it. See if you can cluster your 504s such that you can start to try and answer the question which books have bibliographies which books have indexes. Try fixing your dates, have fun with that. All kinds of fun and interesting problems you'll run into with dates and have to come up with a way to solve. But, boy will you learn the tool that way. You know your catalogs a lot better than I do. Right? You know where the worst problems are. You know the bits of data you had that you wish the computers understood so that they could answer patron questions better. so, you know where the most important problems to fix are. And, that's crucial knowledge. So, learn the tools by fixing the problems that you already know are there. So, experts to watch. ^M00:49:39 [ Foreign language spoken ] ^M00:49:40 Christina Harlow doing fabulous work, amazing work. Just published an article on Katmandu and Fix in the Code4Lib Journal. Again, I don't recommend that you dive into that right away. But, if you want to see what real people are doing, Christina's one to watch. Karen Coyle of course, Diane Hillman, of course, Owen Stephens, Owen's a Brit doing very interesting things just playing around with data to see what he can do. He's got a nice blog at ostephens.com. I rip him off constantly for stuff for my linked data class. And, of course, OCLC research because they're doing this at scale, they have to and learning very interesting thing thereby. Person I would particularly watch is Tom Hickey who has a blog, I meant to look it up and then I didn't. But, yes, you can find him online. So, you got a lot of work to do. And, we know it's got to be done. So, let's jump in there and do it. Thank you very much. And, I'm happy to take questions if you have them. ^E00:50:50 ^B00:50:56 >> Mark Winek: Thank you so much Dorothea. >> Dorothea Salo: Oh you're welcome. >> Mark Winek: We do have microphones. And, we would ask that you raise your hand if you have a question and so that other people can hear your question. >> Dorothea Salo: Mark, there's one way in the back there. ^E00:51:12 ^B00:51:24 >> Frank Newton: I'm Frank Newton. I am. >> Dorothea Salo: Hi Frank. >> Frank Newton: I am, I wanted to ask you whether computers contain hypotheses? And, the reason I'm asking is because it seems to me that that's what human beings do with punctuation. >> Dorothea Salo: Yes. >> Frank Newton: I'm a huge fan of punctuation. And, I think the reason that MARC records have a lot of punctuation in them is because citation standards like MLA and APA have a lot of punctuation in them. And, I think that catalog cards began as an elaboration of a citation format. >> Dorothea Salo: I agree. >> Frank Newton: And, I don't mean to say that people, while people were developing card catalogs and what information and how they would format it on card catalogs they were constantly saying we want to make this like citation format. I just think that that was constantly in the back of their minds and not necessarily, not necessarily articulated a lot. So, lots of punctuation, but, by and large, mostly, it seems to me, periods do not have an infinite number of meanings. They have certain meanings that have high probabilities. And, one of them is the one you mentioned about abbreviations. So, suppose you have a globe and you have metadata for a globe and in parentheses you have DIAM., then, what I would like to the computer to do is to entertain the hypothesis that that period is there because there's a longer word that begins with DIAM that has to do with globes. And, that's the way I believe a human being would operate. >> Dorothea Salo: Yep. >> Frank Newton: And, and we go to our mental dictionaries and we find the word diameter. And, I don't know how exactly we do that. But, I don't completely feel that that's beyond the realm of programing to figure out how we connect DIAM with the word diameter. So, and, that's another, in addition to saying that I like punctuation, I also wanted to say that I think sometimes we assume that things are un-programmable when, in fact, they're perfectly, they're really not that hard to program. And, I would, I wish, I'm not a programmer at all. But, I wish that somebody would program computers to use the singular and plural form of nouns accurately. So, I don't it when a computer says to me one books because I know that, after one, there's, you're supposed to leave off the S at the end of books. And, I don't see why computers couldn't be programmed to understand that too. And. >> Dorothea Salo: The question is, how much money and programmer time are you going to spend asking a computer to act like a human being when computers can't really do that? And, we have thrown an awful lot of money, an awful lot of time and a lot of vendors and a lot of developers asking them to do this impossible thing because, no, computers don't get hypotheses. It's yes or no. Right. That's what they get. And so, what you end up doing, and the singular plural problem is relatively easy to solve. That's an easy one. Anybody know about Inera? I can't remember where they're based now. They're a service software company for publishers. And, the reason they exist and the reason they make their money, and they're a very profitable corporation, is that they have written immense amounts of computer code just to parse citations. Right? That's a multimillion dollar industry because computers are so bad at it. Right. So yeah, we can solve these problems. Don't we have better problems to solve? I guess is where I'm at on that. More questions? >> Christine Delany: My name is Christine Delany from American University Library. >> Dorothea Salo: Hi Christine. >> Christine Delany: Hi. And, my question has to do with authority work. >> Dorothea Salo: Yeah. >> Christine Delany: I've heard some received some mixed information about whether or not, in the linked data world, authority work will be necessary to perform. We won't need authority records etcetera. Can you say a few words about your thoughts about authority, authority work and authority records in a linked data world? >> Dorothea Salo: I think there's going to be gigantically more of it. When I say control all the things, that to me means we have authorities and more of them. We have more controlled vocabularies. So, authority worked in the linked data world is going to be focused a little less on figuring out the labels and more on, okay, we've slapped an identifier on this person who just wrote a book for the first time or on this place that just became a place. And the authority, the authority work becomes what else do we know that we can attach to this identifier? We know the name for the place. We know the time it was established. We know where else it is in the universe, things like that. So, authority work will change a little bit, the emphasis is going to change. But, yeah, we still got to do it. The fun thing is, though, that we don't have to do it individually and in parallel because that's what the network buys us. As soon as one person knows something interesting about a person represented by a URL, they tell the world. And now, nobody else has to gather that piece of information because it's out there and we can ask for it. So yes, definitely still authority work but it will be networked authority work. >> So, I had a question about back editing subject headings. >> Dorothea Salo: Yeah. >> Because. >> Dorothea Salo: Scary stuff sometimes. >> Because a lot, I mean, you were talking about how other data's inconsistent. But, as a cataloger, I've actually found that subject headings are the most inconsistent because there's so many terms that can be used for like one piece of information. And, there's only so many that can be put in a record without overwhelming a record. >> Dorothea Salo: Right. >> So, I was wondering if, if there's a bunch of similar records, if an institution can sort of like batch edit those? >> Dorothea Salo: I hope so because we're going to have to do that a lot, well, not a lot. I don't actually know how much. Let me be clear about this. A lot of this problem arises because subject terminology does change over time. And, sometimes we have time to fix it in our records and sometimes we kind of don't. So, that's just, you know, acute time poverty is one reason that subject headings get into such a state. So, there's some interesting things that can happen with subject headings in a linked data world. And, let me see how I want to frame this. One of them is that, I think, we, I think there are better ways to handle pre-coordination and presenting pre-coordination to patrons than we currently have. So, I think that, actually, once we do the work, will be a big benefit. Some of it is actually using a subject heading to follow our nose and not just through library collections but through digital catalog, or digital collection through archives that don't even necessarily need to be using the same subject vocabularies that we use in libraries. That's okay because vocabulary developers, right, controllers if you will, are working very hard behind the scenes to draw equivalences across vocabularies. And, in the linked data world, that can actually become very seamless. So, the most interesting linked data projects to me and the ones that I like to show to our students are the ones that do the best work, not just updating subject headings not just making them internally consistent, that's going to be a Sisyphean effort, but, leading people to the richness of all of the stuff that we have, collectively, as a society, all the information that we have on topics of interest, people of interest, things like that. That's a very rambling answer. And, I don't know if I answered the question. But I try. >> Mark Winek: Two more over here. >> Dorothea Salo: Okay. >> [Inaudible] currently at George Mason University. >> Dorothea Salo: Hi. >> Cataloging librarian. I think that administrators need to hear what you said today much more than we do because they're the ones who basically tell us what to do. And, when they're counting how much work we do, they don't say how many errors did you fix today, they want to know how many missed errors have been processed. >> Dorothea Salo: Hear, hear. >> And, in some cases, I've read that people have been specifically forbidden to make correction that they see needed work and. >> Dorothea Salo: I have heard this too. It makes me so sad and so angry because it's just making the work down the line even worse even harder. >> Yes. Even in the members of the OCLC Cooperative where the demographic records are not your records but our records. >> Dorothea Salo: Oh don't get me started. I actually called out OCLC on that one in person once. >> Thank you for doing that. >> Dorothea Salo: It was an interesting conversation. >> And, you may be familiar with the type of the day project. >> Dorothea Salo: Yeah. >> Which began ten years ago based on [inaudible] database of records. That project is sort of [inaudible] to a stop but needs to be revised, perhaps not fixing individual words but fixing entire strings. That could be done mechanically it's real dangerous. >> Dorothea Salo: Some of it yeah. >> To go in and try and fix mechanically individual mistakes. But, if you have a. >> Dorothea Salo: If you have a good testing tool, if you can see what's going to happen before it happens, that's a great thing. >> And having fixed numerous cases of misspelled university then gone back years later. >> Dorothea Salo: Yep. >> And found that 30 of them have crept in because vendor records contain them. Vendors are the ones who need to be tending to this. They're. >> Dorothea Salo: Hear, hear. >> On a cache basis. They have no motivation to do anything other than promulgate their wares. >> Dorothea Salo: The least possible yeah. >> So, these are comments that I have made. And, I think a lot of people who were involved in various project have retired. New people coming in need to be trained in the fact that these are the kind of activities we need to be engaged in. >> Dorothea Salo: Hear, hear. And, I'm doing my best at the library school where I'm yes. But yes, 100% behind you. >> Virginia Wayen: Hi. >> Dorothea Salo: Hi. >> Virginia Wayen: I'm Virginia Wayen from Frostburg State University. >> Dorothea Salo: Hi Virginia. >> Virginia Wayen: Hi. I like to think that when we're making [inaudible] like this, that we think, for example, not going where Google is but where Google's going. Are you aware of any of the things that are just on the horizon now that we need to be aware of when we do these changes? >> Dorothea Salo: Yeah. One, that's a great question by the way and thank you for asking it. The next time you redesign your website, right, when you're redoing all of your templates, the next time you redesign your digital collection or even the next time you scheme your catalogs, look at something called schema.org microdata. And, you can learn about this at schema.org, perhaps unsurprisingly. So, the bit of history that I didn't get into here is that the webs first fling with what is now linked data, the semantic web, was a complete flop. Right. It didn't happen. And, it didn't happen because RDF is a beast. Right. It's really hard to learn past a certain relatively simple point. It's hard to implement. And, web developers took one look and ran screaming in the opposite direction. I can't blame them. I don't like RDF either. So, this time around the major search engines, Google, Bing, Yahoo, a couple others, I don't think DuckDuckGo is involved though, I think they should be. If anybody here is in this room, DuckDuckGo, please do that. So, they got together and said okay we need to break this down and make it simpler. And so, they came up with a thing called microdata that you can implement directly in HTML and it's got some very specific vocabularies that you can use to describe things like libraries. The holy trinity of library information, right, address, hours, events, you can put that in microdata and then the search engine can pick it up and show it directly in the search result. If you do a Google search on almost any current movie, you will see much richer information that you used to because IMDB, the Internet Movie Data Base has implemented microdata very effectively and Google picks up on that. In the Code4Lib Journal, a couple of other places there are examples of libraries improving their search results and actually moving their search results up on search results pages because of microdata. So, the one thing, if you had to pin me down to one thing that's on the horizon that I think is worth looking at, I would say microdata check it out. Next time you're doing a redesign, see what you can implement. >> Mark Winek: I think we have time for just one more quick question right here. >> Kari Brady: Hi. I'm Kari Brady. I'm from Maryland's Department of Legislative Services. >> Dorothea Salo: Hi Kari. >> Kari Brady: And, I have a question about making more consistent records that describe items that are inconsistent. We, most of our cataloging. >> Dorothea Salo: Wonderful [inaudible]. >> Kari Brady: Yeah. Most of our cataloging is government agency reports and task force reports and all that. >> Dorothea Salo: Yeah. >> Kari Brady: Fun stuff. And, many of them are annual or, you know, at least serialized in some way. And unfortunate. >> Dorothea Salo: Are you serious? >> Kari Brady: Yeah. Unfortunately, because they're not, you know, published by a publisher who wants to be, you know, wants to be found again, so has some steak in being consistent. >> Dorothea Salo: Yeah. >> Kari Brady: They are quite often they change titles very frequently, which isn't as big a deal because you've got all those 246's. >> Dorothea Salo: Right. >> Kari Brady: But, they change formats frequently. They change even the laws that require the reports change frequently. And, sometimes, those get to be a real mess as far as tracking those in the record. But then, the biggest problem comes in when they've done something strange about, well, they skipped a year here but they made it up in this other one. And so, the dates are all messed up. And the patrons are going to be looking for them but they don't know that they actually split this report up into this report and also this other report that's shelved over here. And so, with some of these records, we end up with a zillion five hundred notes. >> Dorothea Salo: Right. Because. >> Kari Brady: And of course they're free text. >> Dorothea Salo: Those marks doesn't give you any other way to represent that. >> Kari Brady: Exactly. >> Dorothea Salo: Right? >> Kari Brady: And sometimes it's so complicated that it has to be free text because there's really no. >> Dorothea Salo: How in the world else. >> Kari Brady: Standardized. >> Dorothea Salo: Would you tell even a human being what in the world is going on here? Yeah. >> Keri Brady: Yeah. >> Dorothea Salo: So, explaining things to human beings we shall always have with us. Here's what I'll say about that. Linked data, because it is intrinsically granular, right, you've probably all heard the idea that we're exploding the record and we're making statements. Part of the reason we're doing that is that the record is kind of a, since I'm on my Greek Latin kick today, procrustean bed. We lop off anything that we have to lop off to make whatever we're describing fit the record. I think linked data will let us stop doing that. And, I think, particularly, for when you have these strange edge cases and for formats that MARC has never dealt real well with, there's a great Code4Lib Journal article from a computer programmer trying to get to grips with how MARC handles music scores. Right? Exactly. I think we will expand into a world that's much kinder to things that don't quite fit the catalog card model that was designed for books. That's my hope anyway, that's my hope. >> Kari Brady: Thanks. >> Dorothea Salo: Sure. Well thank you all. This has been a lot of fun. >> Mark Wiken: Thank you. >> So we're, I think our presentation today is going to compliment Dorothea's actually nicely. We are going to try to help you prepare to be linked. I'm going to give kind of a high level talk and then Jackie will get more down in the dirt and give the real meat of some things you can do. ^E01:10:17 ^B01:10:22 >> Nancy Fallgren: Okay. So, how many of you have been to linked data talks before? How familiar is this slide? I hate Legos. I'm getting this out of the way now. This is the obligatory Legos side. We're done, it's over. No more. Okay? I was just talking to someone about stepping on them and how painful that can be. There are many reasons not to like Legos. So, back when I was a library school student, at the University of Maryland, Dale Flecker from Harvard came to speak to us about the future of libraries and technology. And, the presentation that he gave was full of such doom and gloom that I walked away thinking why am I even bothering to get this degree. And, basically what he, he painted this really dire picture of the inabilities, the inability of libraries to ever catch up with technology because we are so bogged down by our processes we're not agile enough to make any progress. So, what I'd like to say to you is, yes, I'm on the linked data bandwagon because linked data gives me hope. It gives me hope that we can significantly or at least a lot decrease that technology gap. I think we're at a crossroads. I think behind us is the comfort of what we know, a technology that we've been using for 65 years that we feel warm and fuzzy about. And, on the other side, is an opportunity to progress to a more kind technology, along with all the anxiety and discomfort that comes with change and unfamiliar territory. But, if we can embrace this change now and adopt agile practices, then, perhaps, we can actually gain ground on technology. And maybe, moving forward, we can catch up and maybe even keep up. So, MARC is behind me at the crossroad or behind us at the crossroad at our backs and linked data is in front of us. At a high level, a change to linked data is a change from operating in silo databases to working with technology that's aligned with the semantic web. It'll force us to be more aware of how the resources we're working with are related and how our data works on the web. ^E01:12:59 ^B01:13:05 All right. So, I mention semantic web. I really hope, my, one of my dreams is that I will never have to explain this again. I really hope that this slide can be thrown away within the next year. So, what is the semantic web? The semantic web is simply a web of data. It's a more granular web than the one we're used to. The web we're used to links documents to documents. The semantic web links data to data. the language of the semantic web is resource description framework or RDF. And, RDF is expressed in triples. Triples look like simple sentence, subject, predicate, object. The idea behind triples is that they should be expressed, or that the data, the pieces of the triples, the subject, the predicate, and the object are preferably expressed as IRI's. And this is a different acronym than the one that Dorothea used. She talked about URI's. Jackie and I are going to say IRI instead. So, what is an IRI? An IRI is an internationalized resource identifier. It's URI that's basically internationalized. So, what does that mean? URI uses a limited character set. IRI uses a universal character set to allow expression in non-Latin scripts. So, the idea is that IRI's can then be round tripped back to URI's and then to other languages. So, it's an internationalization of URI. So, if I am Chinese, I can see URI in Chinese and understand what it's saying, you can see it in my own language. We may go back and forth a little bit, Jackie and I, between IRI and URI, but I think, mostly, mostly you'll be hearing us say IRI. So, we just wanted to make that explanation. And, Jackie will be going into this in detail when she talks to it. One of the really important things to keep in mind, when you're talking about linked data and the semantic web and you're talking about RDF, and this is the concept that I think, sometimes, people or catalogers may have difficulty grasping because we think in terms of records. No more records. We're talking, in RDF, about pieces of data. So, you don't have a record to give each piece of data context. Each triple has to be understandable on its own out in the world. It has to be able to sit out in the world and make sense. No record to give it context. The context of the object of the triple is its subject. Okay? Important concept, really important concept to digest and understand. So, first let's start by, we'll talk about how we can prepare ourselves for linked data. And, I am right up there with Dorothea learn, learn, learn, learn, learn. So, one of the things that we can do in advance of this becoming a standard for us to use, because, it's not a standard. It's not an adopted standard for us to be using right now. Right? We're preparing. We're preparing so that, when we cross that road, it's not, it's not so anxiety filled, we can mitigate some of the anxiety that comes with change with learning. And, when I'm talking about learning, I'm saying explore, buy a book, do it on your own. Look through the web, there are places at W3C where you can learn about linked data. There are, there's information about linked data all over the place. You can take formal linked data training. There is that available. If you're an administrator, I would say to you have discussion groups. Invite your staff to sit down and maybe have a regular brown bag lunch where they can talk about their thoughts their concerns and learn from each other, from each other's learning. Make it an open discussion. And then, if your staff or you, your library can do it, get involved in experimentation of some kind. It doesn't hurt to be playing with things, you can't break anything because none of it is real right now. It's just play. It's all just play right now. One of the things that we need to do is adjust the way we think. Right? So, as I said, we need to think in terms of data rather than records. So, for example, MARC allows us to say that a bib record was created by one entity and then edited or modified by other entities. I hate the 041. But, at a data level, we can't identify. When you look at a MARC record, you have absolutely no idea what piece of data OC created, NOM modified, and George Washington University completely replaced. You don't know. All you know is that those institutions touched that record and that you think those institutions do good cataloging work. So, do we really need to know? Do we really need to know exactly what piece of data any particular institution contributes to MARC record. We don't know it now. Why would we insist on knowing that in a linked data environment? Every time we use a triple, every time we go out on the web and pull something in, and Dorothea's illustration of copy cataloging, going out and pulling in different pieces, every time we pull a piece of data in, we are asserting that that piece of data is an authoritative piece of data that we believe that that piece of data is true and accurate. So, does it matter where it comes from? Does it matter who originally created it if it is true and accurate because that's really what we're shooting for in the end is accuracy? So, there's conversion programs out there right now, a couple conversion, a legacy MARC to link data. That conversion will be Lossy. Get used to it. Go with it. Move on. We are not having MARC in RDF. There's MARC data that just doesn't necessarily make sense. What we need to do, when we're looking at this conversion of legacy MARC is to think about what do we need? What can we afford to lose? There's a lot of redundant data in MARC. Right? what can we afford to lose and what do we really need to keep and who are we keeping it for? If we've been recording some piece of data for 65 years because some day somebody may want to use it, and no one has or it's been used twice, I say throw it away. I clean out my closet at the end of every season. If I have not worn something that season, it goes to Goodwill. We need to start thinking in those terms. I know it's a little harsh. But, it keeps my closet reasonable. So, some of the best training, most that I've ever had was agile development training. The idea is that you assess requirements, you develop a solution, and then you reassess, you decide if the solution still works, and you may revise the solution. And then, you go back and you reassess, revise the solution. It's an iterative process. And, one of the things that we need to think about, if we're going to keep up with technology and we're going to keep moving along, if we take this linked data path and we're going to keep moving forward and we're not going to be left behind the way we've been left behind with MARC, we are never done. The process is never, there is no such thing as finished. It is, you're always iterating. You have to get used to that. You have to start thinking that the data may still all be there, might be called something, might be tagged a little differently, it might show up a little differently, dates may be formatted differently, because five years from now we find out there's a better way to format dates than anybody's been doing before. It's okay. Change is good. Change is okay as long as we keep moving forward. So, I want to talk a little bit about preparing data. And, I think I can cut some of this out because I think Dorothea covered this really well. So, what can we do? What can we do to prepare for linked data? We have data model development going on. So, what does that mean? We had a question, I'm sorry I didn't, I couldn't see who asked the question about serials. So, okay, so in an environment where you're looking at every piece of data and how ever piece of data relates to another piece of data, you have to think about how do we model that data so that it describes when you group it, or when you pull pieces of it together, it describes what you need it to describe in a good way, in the way that you want it done? Right? So, you have the title. Serial has title. Serial has title, title. You have serial has publisher, publisher. You have serial changed frequency. Ah oh, what do we do now? Right? Maybe what we do is we model a data event and we say serial had event. Change frequency. And then, we move from there. We have to think about how we can say these things in linked data, the things that we need to say, need to say, how do we say them so that they make sense and that the stupid computers and the stupid programs that we use can put it all together and have it make sense to someone else looking at it? And, work like this is going on. NLM has been working with George Washington University, University California Davis, and Sephora to model a big frame on, a big frame of linked data model for motographs. We started with low hanging fruit and we actually published that back in December. Sephora, if you are an alumni of Sephora, training, they have alumni cast groups that they're putting together now to work on data modeling for specific types of resources. And, anyone can join those. If you have to have gone through training but you can join those and you also can follow them. If you, even if you can't join, you can follow. There's a concert group that's been put together also that's been put together to talk about modeling for serials. So, there are things out there. And then, of course there are the LD4 projects, LD4 and LD4P where they're also looking at, part of those, part of those projects is looking at how they model data for certain resources. So, if you can't do these things, you, or be part of these particular things, you can at least follow them. And, you can locally warn institutions. Start to think about, start to think about how would I model, serials are a great one, how would I model a serial? How would I, how would I make this look? And then, you can contribute what you know to lists, you can contribute it to groups, contribute it however you can. We already have a bunch of RDF ontologies or schema. There are a lot out there. And, we need to start thinking about which ones do we want to use or which pieces of ones do we want to use. For example, there's one called ProGo that has really nice modeling for providence information. Maybe we want to use big frame and then also use some ProGo predicates when we're describing something. So, we need to start thinking about that as internally. What are we doing within our library? But then, we also need to think about what are we going to, are we going to have preferred ontologies or preferred schema for certain kinds of information certain kinds of data across libraries so that we can make sure that we're all talking the same language or speaking the same language that we can all link together. Whose data do you want to use? Whose authoritative data do you want to use? I have a colleague at UC Davis who, her dream is to have an ISSN, plug it into a form, have that form send it out to the ISBN portal, pick up linked data from the ISB portal and populate her form. It's a great dream. But, she needs to talk to the people at the ISS or the ISSN portal, so she needs to talk to people at the ISSN portal if she wants to make it happen because they have to know that she's interested. So, at NLM we just finished or we haven't finished, but we are closing a data project where we publish MESH in RDF. The way that came about was, as part of the frame, early experimenters, Eric Miller of Sephora, the first time I met him, sat me down and said I need to talk to you. And we sat over dinner and all he kept telling me were used cases that he knew of that where people wanted to use MESH in RDF. And that was the impetus, that was what got that project started. So, you have to tell people that you want to use your data. They have to know that there's a need for their data in RDF to give them the impetus to change it over so that you can use it. And we need to, we need to expose our authorities. And, I think, Dorothea covered this really well. If you've got a name authority or if you've got something, a local name authority, your professors, your faculty, get those names into, you know, the shared authority files. Get those names into VIAP or Orchid, or Iznio [assumed spellings] or whatever, so that your faculty can get out there, so that people can find your faculty, you can link to your faculty, people can link to your faculty. Share that stuff so that those names are out there and they can be used by others. ^E01:28:02 ^B01:28:15 So, when we're recording data, and again, Dorothea hit on this, we need to think about best practices. Here's a PCC, and Jackie's going to talk a little bit about this because she leads the group, but there's a PCC task group on URI's that started, all really kind of got started when Steve Fulsome from Cornell University shared a paper, white paper that he had written with some friends about the many ways that identifiers are recorded in MARC and the need for a consistent way to record identifiers in MARC so that we could convert them into actionable IRI's, actionable HTTP IRI's. You could do that. If you see a need, what he did was, he wrote that paper, somebody said to him you know what, you should bring that to MAC at ALA. And, he brought it to MAC and the folks at MAC said, you know, you really need to bring that to the PCC. So, he brought it to the PCC. And now there's a PCC task group looking into best practices for the creation of IRI's or URI's. If you see a need, get the ball rolling, say it, collaborate, let people speak. Speak. Tell people. I think if we do this, or this doesn't look to me like it's going to work because it's a string not an identifier or there are some different ways to format this identifier. How can we make this better? If you see that need, participate. And, I am so in agreement we need to deemphasize transcription. I mean, some things, you can't help, you got to transcribe a title, you got to transcribe right, it's a title. But, but, roman numerals, really? Really? It drives me absolutely insane. I don't think people even know what they mean half the time, forget computers. Right? It's just, really lets drop the roman numerals. And then, there are other things like that. Right? There are other things like that. If you really look at the MARC data, really take a look and see, do I need to transcribe this or can I do this a different way? Another thing that we can look at now that we can, we can start seeing as looking at the linked data pilots and some of the tools. So, Library of Congress has got a whole bunch of tools, right, that they're putting out and making available for you to use and play with and experiment with. Sephora has some tools. UC Davis is working on making available a cataloging module that will have the data entry user interface and you can then see what your data looks like as linked data. There's also RIMMF, that Deborah Fritz has put out. So, there are things out there that you can play with. Play with these things, take a look, see what it looks like. And then, you can actually you can make suggestions. You can say you know I would really like this better if it did this or this would work better for me if it did that. Right. So, Sephora's one of the things, the Sephora's tool is that they you can actually go out and, if you're looking for an authority or name authority, you can say I want to see the LCNAF, the LC name authority and it'll give you a little box and it'll show you what the data is. right n ow, it's only machine readable. So, one of the things you can tell Sephora is hey you know what, I need this to be human readable because this makes no sense to me right now. This doesn't help me to distinguish it from anything else. You can play with these things, you can make comment, you can help effect how they are developed. So, we can also prepare for discovery. You'll like this one Dorothea. We, this is from the Treasurer of the Tres Riches Heures du Duc de Berry 15th Century Book of Hours. We need to, we need to seed the semantic web. We need to seed it with authoritative data. Our standard vocabularies, subjects, subjects MESH LC, LCSH. Some of this work is already being done. Right. National Agricultural Library has done this with, and if there's anyone here forgive me is it [inaudible]? And, you know, name authorities. All of these things, these are all seeds, these are all that we get these standard vocabularies out there identified in a standard way and then we can link to them. But, if we have nothing to link to, if I want to, if I'm, if I'm an NRM and I'm cataloging, I'm trying to catalog with some a book in linked data and we only use MESH subject headings, and MESH is not available on RDF, I'm stuck. Right? How's anybody on the web going to find those subject headings if given a linked data environment? ^E01:33:13 ^B01:33:17 And, we think, we need to take, this is, I lie awake at night and think about these things. But, with the prospect of, I don't know, hundreds of thousands of libraries, I don't even know how many libraries there are across the world. Right. Exposing their collections as linked data on the web, I think, there's going to be an information explosion. Maybe I'm wrong, maybe there's some way to manage this that I don't understand. But, how will libraries manage that? Right? So, we want our local constituents to be able to find our resources easily. But, not all our users are local. And, we want our collections to be discovered across the world. So, everybody has a copy, not everybody, but, most people probably have a copy of Grey's Anatomy, right. That's not one of those things that is unique to any one library. But, Marshall Nirenberg's Nobel Prize is unique to the NLM. We want people to know that that's there to be able to find it and maybe be able to see it if we've got a photo of it or his collections to be able to find those. Those are the things we really want to stress. Right. So those are the things that I think we need to think about first. Sure, it's easy to model books as Low Hanging Fruit. Serials need to be done too. But, you know what, nearly everybody's got a copy of Jama somewhere. Right? Everybody's got a copy of Nature somewhere. Not everybody has a copy of Marshall Nirenberg's papers. When Diana, so one of the things that I think we need to think about is can we work with the search engine providers? Can we work with them to help get our expose our library data in a way that's manageable that doesn't explode people's brains? When I worked, I was a consultant to the working group on the future of bibliographic control which Deanna Marcum put together. And one of the things she did was to invite Google and Microsoft to the table. And, they agreed to join. They were at every meeting. They were at every, every public presentation that went on. And, behind the scenes they worked with us. Maybe it's time for us to reopen that door. Maybe we need to talk to them and say these are our concerns. Maybe you have a way to manage this. Maybe we can figure out together a way to manage this. Maybe you don't care. Maybe, we can work with them. So, this is, this is a knowledge graph. So, Dorothea talked before about using schema markup, microdata markup. This is a knowledge graph from Google, which, if you look up Marie Curie, this is that thing that shows up on the right side of the screen and gives you some information. And, this is all linked data. This is all coming from Schema. Libraries aren't here anywhere. You've got quotes. Right. You've got a little bit of biographical information quotes and then you've got people also search for. What if, what if it also included resources in libraries and museums? What if there was a way to say, hum, you know there's so many millions of pieces of data coming in from libraries, what if we group it? Oops. And, even better, what if we can group it, because we have such great data, and say I want literature, I can show you literature by Marie Curie, we can show you literature about her. We can show you her letters. We can show you her manuscript archive collections. We can show you images and we can show you film. What if we give them? We can even give them some facets because we do have such great data. Something to think about. So, at the heart of all the stuff that I've been talking about is the ability to link our data. And, ultimately, that can't happen if we don't use IRI's instead of strings wherever we can. So, I'm going to turn this over to Jackie now who is the IRI guru. And, she will talk to you about her experiences with IRI's. ^E01:37:23 ^B01:37:42 >> Jackie Shieh: So, the secret is the identifier. And so, with this, the secret is out. So, my portion, hopefully will not disappoint you is that how TW intended to bring that down to our routine daily operation level. And, as you know, my long journey with electronic resources. I was at UVA playing with TI Header and I was at University of Michigan I was working with Google Books, working with finding things out in copyright and now our print. And now, I'm in GW working with a group of staff to figure out how we could expose our data. So, Nancy eluded to earlier the IRI and this is what my focus is going to be is making sure that we are looking at a script that is usable across different domain and it is accessible globally. With that in mind, I think, everyone by now has an idea the triples. And, the second I that Nancy eluded to eluded to [inaudible] is the focus of what and RDF statement is, is this identifier that is unique. And, it is universal. Now, the thing is, with this IRI, it has to be that it can provide linking asset through a variety of networks globally. So, this is essentially is, if you were to work with an RDF statement, you can see that an identifier is an typical element of this identification process. So, why do we put so much emphasis on understanding an identifier? And, from this point forward, I'm going to just use the word IRI. It facilitate the exposure, inhale, transmit, and exchange all data by the networks. Now, when it is persistent, unambiguous, and uniform syntax, and it is consistent over time, it is of high value because you were allowed computer to not only understand the content but also derive knew knowledge about reasoning about that content. And, that is the key. In an IDF environment where the process of URI is shows how this allows the machine to not only understand what is being connected but also have the ability to reason and derive new knowledge from it. And then, that is how the web of data gets created and the URI is undeniable the hallmark of whole process the link open data. TW has been fortunate to have been able to participate in the process of different initiatives as an early experimenter and a test implementer and experience with National Library of Medicine, they say begin to contemplate the approach, the modular approach. We were fortunate enough to be at the table and observe and participate so, how do we get to where we are right now is simply because through the grueling process of two years that we participated in those different initiatives, we went through numerous webinar, study halls, exercises, pair studies that made us realize that we need to begin to have a vision within this. What do I want to do with my profession, to move forward to be counted in this web aligned servers? How are we going to learn to be able to make use of the skills, the knowledge that we could already have and contribute to the wider world of the knowledge of information sphere? So, throughout this process, we begin to have our a vision how we can move together as a team. We learn how to facilitate that [inaudible] through the process that open source software make available to us. And my hearing from Dorothea sharing the skillset, I must say, I was really comforted to know, two years ago, I couldn't have said this, in front of all of you. But, now, my team, the only skills that we are in lack of, in need of, is [inaudible]. We have every single one that Dorothea mentioned on her slide. And, it took three years. It took three years constantly learning and throughout the process. So, you know, it is not difficult for all of us here to sit aside time and communicate to your manager, your supervisor and say this is, I have a vision. I'd like to do this. Do I have the time? You give me the time. How I can adjust my workload in order to accomplish that? Because, this is event, you know, ultimately, it's a great contribution to the profession. So, in that, we began to consider how then we could continue our experimentation. How can we use this in the MARC environment? How is it that the RDF linking concept can be expanded in our current IOS system? So, here's some of the questions that, before we started, we had to, there's a way to attempt to answer. IRI in MARC, with MARC we encode data in MARC. You would have to decide the field where it goes. And then, you have to know is there a set field for it? And so, how did this relate to the process that we already learn when we were experimenting different? In order to accomplish that, what type of workflow and changes must take place so that we could accomplish this little tiny bit of the process? And so, we began investigating and doing research. So, the first thing we found on the MARC standard side were this is instruction. It is, there is ability for us to play with the data and for, in our case, we didn't really quite understand, when we first began in 2013, September, we didn't quite understand how we can construct those using the form that we're most interested in. And, we did not know what kind of impact that the data would do to avoid your system. So, our focus is on nonliteral surrogate value which is the HTTP URI not the text read because over the years we've been playing with data. We've been shifting our mindset from putting in stream, thinking the stream in terms of stream is data. So then, what else? Because we are also an OCLC library in addition, we need to know what that means in an OCLC environment. And, unfortunately, we can't do anything in OCLC. This is the guideline basically said nothing can, we could not do anything with Sephora 0. So, that essentially opened up the door for us to experiment locally in a local voyager system. So, before that, we also began to find to see if we could locate other type of usage in Sephora 0 that already in connection. And, most of them we found this format that's very much like the 035 that everyone was aware of. So, will put an organization code within purines followed by another numeric identifier. You could also use the international standard code that is provided. And, you can find that at the Library of Congress site with the identifier associated with that particular service. So, along with all of this research that we have gained so far, we began to test our data. And, our favorite example was Miss Jen Olson's work. So, in 2013, we extracted about less than 100 Jen Olson's work to see how well we can do. We [inaudible] IRI in the [inaudible] at the same time we have the stream base IRI, stream base IRI in the system, we populated in the name heading, set the heading and the tracing which is now the 700s area. We loaded about 500 Sephora 0 [inaudible] Voyager back in 2013. And to our surprise and perhaps a sigh of relief, we didn't break the system. It just simply be there. It just simply there. Our web services did not even notice. And so they, nobody shout out and said what are you doing. So, so we, we, so, let it sit for a little bit before we move forward with it because we didn't quite understand the impact of an identifier in an IRI HTTP environment. And, the reason we finally settle that if we were to move forward putting identifier in IRI Voyager, we were up to do a HTTP IRI. And, these are some of the reasons that convince us that is the route to go. And one opportunity came, and I think all of you probably have that in your environment. In 2014, August, our Student Committee, TW is part of the Washington Research Library Association. One of the initiatives that was set out by the Student Committee as a 2015 initiative was to develop and implement a plan that would successfully transition library's demographic records to the new library system platform including consolidating to a single record addressing authority control and tine implementation of RDA. Now, one of the very first thing was identify by the Metadata Committee, was perhaps, most of you are not unfamiliar with is the reclamation process with OCLC. So, I don't think I need to go into detail what reclamation does or is. So, we began our reclamation project in October of 2014 which initiated a process. But, the whole project didn't actually start until January because we used that about three months process to identify strategies, develop a plan, and how it would be workable with limited staff that we have. And also, through all this process, I'd been talking too Terry Reese because back in August of 2014, he had mind to begin populating his [inaudible] additional tools to prepare libraries for linked data. And the first thing he put in that tool was different transformation tool. So had it not for [inaudible] we had to go through grueling process of writing code in Perl, which that I'm familiar with, utilizing a series of Perl modules like LIBCURL, Library Client URL Protocol, LWP, LW in Perl to query data against RDF services. So, from March to May, as soon as we get a report back from OCLC. You know that our data has part OCLC. Our next step is we need to put our data back into our system. So, at that time, we were thinking, if we were to do that, we can now take this opportunity to enhance our data. We, the commission project, essentially, we were thinking of doing close to two million TW records. But, we have passed it out in phases. And that the phase one was the almost a million [inaudible] and it was 999,785 type records. So, it's a little bit shy of a one million records per physical pieces. So that's means book, serial, DVD, CD, CD rom and all of the above. We didn't, we left out electronic resources because we were advised to do it differently. So, that's where we are by now. So, with that process, we ran, we talked to Terry Reese and worked with him with troubleshooting. And he put, made available these tools which is fabulous, fabulous. We ran our records, we select the Sephora 0 that we wanted to input. The service that he made available through his link data lookup, the triple source that are available is humongous. But, we just wanted to test to see if we just had one and what it would do. And, I must tell you, with five dedicated machines, three very, three plus a troubleshooter, so now fourth staff member, dedicated staff member that months of July and August was sleepless night for many of us, for many of us there. Now, Terry Reese recently also added comment line process. And, that is tremendously helpful because I am a comment line person. So, I can stack the command and prompt feeding in my 390 files at a different time. and so, from the last week of June to the last week of August, there's a reason why we had to get it done was because of school starting the first week of September. And, we had to make sure all our records are accessible and ready for our students. And, this is what he will see before July 26, I think, this is what before we are, the actual replacement took place. And, we couldn't have done this without our WRLC colleagues standing alongside and being with us. I think the latest they worked was like 10pm and, I work early hours. So, when I get up at five I saw my files. So, that really worked out very well. We, our record right now had gone through RDA, so our records are RDA compliant with Sephora 0 GDP IRI available. And, as of September 1st, this is our stats. Close to four million Sephora 0 in our Voyager data. 99.35% of them resulted from our reclamation project. I think this is really doable. And, you know, need a little bit of organizational, you know, talking through with your administrator, talking through with your team. If you really are fretting this programing side of the activities, talk to your programing IT folks. It doesn't take them that long to come up with a solution. So, for now, this point forward for us, we now, we could worry less about a lot of the process that are traditional, cataloging data management processes that we could begin to focus on the following, understanding what type of ontologies IP services are out there, the vocabulary allow us to do a different type of data modeling so that we can serve our communities, broaden their horizon. We came to understand that linked data ecosystems will enable our researchers to have their research exposed to have their research be elevated to the level that they are easily found. And, the metadata that we are created, we are creating right now, reach beyond our library walls and the technology that is associated with this will work efficiently with the data. And this is not just library data, this is data across the globe. And, this activity, like Nancy eluded to earlier, have been noted by PCC. PCC called a task group together with a fine group of members and it's, and an extremely, extremely stellar group of professionals serving as consultants. This team's going to take us some time, it's going to take some time for this team to come together making recommendations. But we are making good progress in terms of looking through what's out there, what's being done, and what is the feasible. And one thing that it's reading in our mind is when we make a recommendation, it will be an incremental recommendation for change. It will not be something so drastic that the rug will be pulled under your feet and you have nothing to hold on to. Now how many of you are Ex Libris customers that are using either Primo or Alma? Oh, not many. Okay. How many of you are Sirsi customers? Okay, now then this just came out. If you are a user, customers of Ex Libris that are using Alma and Primo, a call has gone out this Monday, okay? Sirsi Dynex has provided a widget in their BLUEcloud for libraries to generate identifiers associated with that entity that they are creating. Mind you, this is for discovery. What I shared earlier is the raw data that we are providing our users and our IT folks. This, it's great for Sirsi and Ex libris, but they don't come down to our local system. They stay with the discovery layer. What I am promoting is for all of us, if you can find a way to get it in your system, whichever it may be, it could be [inaudible], follow the standard. Your library discovery team would love this. Because when I, when we made an announcement internally, by the way, this is the first public forum I'm sharing this, so you are the first to know. This is not widely known that GW had almost four meetings [inaudible] in HTTP actionable IRI in the system. In fact, when we started, we shared with Library of Congress, Library of Congress has less than 200 [inaudible] embedded and they are not HTTP IRI. So this is really phenomenal for our discovery team. They are writing programs because they, this has just happened like last month. So we are waiting to see how this will impact our service. But as I'm sharing earlier, when we tested it in 2013, nothing happened. Okay? So you will not break your system, I don't think. So, you know, test it. And then let it be known that there is this subset, there's this pilot, and see what your discovery team can do with it. And I must say that GW wasn't the single institution in this country testing it. They have many people testing it. They are just probably the first one to put it in production. And this is our retrospective process for reclamation, but for the ongoing, incoming title, we have, we are running through marc edit, [inaudible]. Now I know for sure, European libraries are doing this locally. It's just right now, simply because [inaudible] zero, the URI, the IRI, the format you use in marc environment hasn't been settled, so a lot of people are asked and they're not sure how to contribute this data back into a utility such as OCLC and like I shared earlier, you see OCLC said no to [inaudible] zero. So a lot of these things are all happening locally, Swedish Library, the National Library of Spain, German National Library, British Library, these are the libraries that we should be keeping tabs on what they are doing locally. And soon enough, hopefully all of this data in individual systems will benefit all of us at large. So I'm going to leave you this, that I love, there is a tide in the affairs of men, which taken at the flood, leads on to fortune. Omitted all the voyage of their life is bound in shallows and in miseries. On such a full sea are we now afloat, and we must take the current where it serves or lose our fortune, ventures, sorry. This is from Shakespeare, The Tragedy of Julius Caesar. Thank you. ^E02:05:31 ^B02:05:38 >> Mark Winek: Thank you to both, thank you to both Nancy and Jackie. And I'd invite both of them to come up. I think we have time for about two questions. I don't want to. >> Jackie Shieh: Sorry. >> Mark Winek: That's okay. >> Jackie Shieh: Sorry. >> Mark Winek: Two or three questions and then we can run into lunch. So I'll start over there with Chris. >> Carla: Hey, Jackie. >> Jackie Shieh: Hi, Carla. >> Carla: I remember one year ago, we both talked to a global scholar, Dr. Acharya, who is the founder of the Google Scholar. >> Jackie Shieh: The Google Scholar. >> Carla: Jackie, yeah, we lobby him to help library to come up with more meaningful and indexing way in the Google Scholar so that make library resource available in that [inaudible] platform. One thing I remember he mentioned about is the question of a URI? >> Jackie Shieh: Yeah. >> Carla: Yeah. You mentioned about IRI, do you think that did not answer his question, that one day everything would just go to one unique identifier so that like it will not be mixed with, if our library say subject heading on name had a [inaudible] linked to Wikipedia, one day they can really talk to each other with a unique [inaudible]. Is that possible? >> Jackie Shieh: I think right now you can see that example in [inaudible] that is in abrogated services collecting all the identifiers established by multiple national library services for one single entity in one place. So how it will essentially play out is where, I think Dorothea mentioned, you know, we need to figure out what would be the authoritative service that we will promote and use. Now when I say the world IRI, I mean that you use a script, you notice that the script was different when you were using a unicode versus non unicode. So that's for machine to translate. If you were to use IRI with the same value, and the newer system cannot read IRI, it will translate your, the thing within this environment, the machine, we were expecting machine to work smarter. I'm sorry Dorothea. We have expectations that a machine should be usually smarter, so that it would take our IRI, if your local system cannot ingest IRI, to get it back into the URI so you can still get to where you need to go. >> Carla: Okay. I just noticed in your record in [inaudible] there are two [inaudible] zero. >> Jackie Shieh: In 2013, we have actually 3, or 4, or 5, because we are testing. >> Carla: Okay. So each of them. >> Jackie Shieh: They all the same, they all lead to the same authority data, it's just they're formed differently, they are HTTP versus stream, that using [inaudible] and then the alphanumeric identifier value. >> Carla: Okay, get it. Last question. Okay. I'm sorry. Maybe I'll ask you later. >> Mark Winek: We have a question over here. >> Hi, Jackie. >> Hi. >> Hi. Earlier you mentioned that you had chosen not to add IRI's, or at least not through marc edit, for your e-resources because something else was recommended. Could you describe that? >> Jackie Shieh: Yes. I think because we have such a composite practices over the years for electronic resources and it's rather difficult, maybe it's easy for you, but it was quite difficult for my team to really figure out the, because, okay, back to the sense are we talking about ICN, we use ICN, when we are searching for one ISBN and then we realize oh, this is actually the print ISBN, but we brought up the electronic resource ISB record. So then we actually have multiple records and then we have to go deeper to figure it out. Is this the right ISBN for this right e-resources? So that's why we decided to hold off on the electronic resources through this process. However, we were recommended to use collation manager in WMS to set our hold that our ER team uses to set GW's electronic resources subscriptions. But our recent discovery, and I can share with you that our master matcher, Sherry Chang, is in the audience, she uses SQL to query all this data and let us, you know, lead us through this process. We usually spend a month or two looking at that data before we decide what to do. So right now we are still in an investigation process. And so what we are doing is that we've discovered collation manager says we have this many titles selected of this many in this set. Then we check with [inaudible] solutions, which we have more, then when we request the marc records to examine from OCLC's collation managers, we actually got less. So OCLC said we had 311, we actually get only marc records, 289. But in our local system we already had 396. And [inaudible] solutions say we have 615. So they say what do we actually have? So I went to the vendor, vendor says no, you have 622. So I, right now, we are just pulling our hair trying to deal with this electronic resources. So I'm so glad that we forgo electronic resources. And when we first did our recommendation, otherwise we wouldn't be where we are right now, and I wouldn't be sharing with you, so. >> Mark Winek: I think we have time for one more question. Is there anyone else? ^E02:12:32 ^B02:12:39 >> I wondered if you talked to any authority vendors, Jackie, so that we all don't have to reinvent the wheel and do what you're doing. >> Jackie Shieh: You know, interesting thing, I think that they will be our business. I'm sorry. No, one of the things that we, a serendipitous discovery for us, Sherry Chang found that and I didn't even realize it. You know when we, GW's activities all tied in, all of our services tied in with our voyager. But we are so lean, we, our records, a lot of our records weren't touched for the last 20 years, okay? So a lot of our headings are so out of date, we didn't know what to do because we simply don't have the manpower or we don't have that attention span, so to speak, to tackle this huge problem. So I was talking to Terry Reese, and you know, [inaudible], which it's really interesting I remember when I was querying using LWP, querying Amazon.com servers for in copyright out of print title. So when you query data in SQL, you can get back not just your 100, your authorized heading, you could get back all the references. So what happened is our heading was old, so when we send in for query it actually found in the not the authorized heading, but it got back the authorized heading to our system. So what was brought to my attention was because we rely on a lot of gifts, we have donor informations, and we have to identify donors, main entry as donor, and so when we're doing this second phase of our reclamation project, it was too hard to insert back the field saying so and so is a donor. Okay? So we recognize this particular donor and his name was different. So how, we said to ourselves how in the world did this happen? We didn't make any changes on his heading. His name came out with a JR, the other one was just his name and his year, but now he's JR, yes indeed, inserted in between. And so we was like, let's see what else happened here. So that was our experience. Now we never talked to authority vendors because we are not in a position to talk to, you know, authority vendors for our process. So a lot of this process is, you know, we're still in testing mode. We do not know what the future holds. We just know that let's give it a go and see where we are. And if this is total disastrous, fortunately it was okay with our AUL. That was the thing. Our AUL okay it and we, before we actually started on July 26, we actually froze our, we have a snapshot of our voyager data cut off at July 25. So if we broke the system, we can rebuild it. We have a safety net. So fortunately, we didn't need to do that. >> Mark Winek: And so our final speaker for today is Linda Wen, who joins us from the Washington College of Law at American University where she is the head of collections and bibliographic services. Her previous professional experiences includes a term as the head of IT at the University of Arkansas, Little Rock, she has significant research experience in e-resources meta data, technical services work flows and integrated library systems. And welcome, Linda. ^E02:16:47 ^B02:17:44 >> Linda Wen: Okay, can you hear me? Good. Okay. Let's get started and get it over with. I have been quite nervous all this morning. And the beauty about the last speaker is I don't have much to add. A lot of this has already been covered already this morning, so I'll be fast. Again, my name is Linda Wen, I'm sorry about my, on the speaker description, I did not mention that I'm from Washington College of Law, American University. Okay, we will talked about linked data today. ^E02:18:34 ^B02:18:40 Linked data is net worth of RDF statement expressed using published end of vocabulary, which is usually referred to as triple, consists of a subject, predicate and object. Or, in other words, entity, the property and the relationship and the property value. And so, we have a triple statement like a person is the author of the work, or the work is a genre of science fiction, and when we talked about linked data, we talked about the URI's and the subject, predicate and the object usually are resolved, are identified by a URI and resolved to authoritative vocabulary. And all these entities and triples can be interconnected with each other. And so every entity will have for, a bunch of properties circled around, and then we will have an entity graph. So it looks like this. So for a computer, it doesn't know what a Mona Lisa is, just a stream. But with all the ID's around property relationships around it, this name got a, becomes a thing, a life, so the computer's got to know what it is about. ^E02:20:54 ^B02:21:03 So the stream becomes a thing, for the com, for the computers, the things will have a URI which point to an authoritative vocabulary. So this kind of resembles what we librarians do as authority control, except that the authority vocabulary is all of, identified by URI's. So the libraries have been doing, been busy, doing identifying its entities with all the properties around and they interlinked to each other. So we will have for this graphs, connected with each other. ^E02:22:01 ^B02:22:05 And we will have this ideally, all this world, when the authoritative vocabulary hubs connected to each other, we will have this famous linked data cloud. And we can take a look. ^E02:22:25 ^B02:22:54 So this is the site of this open linked cloud, and if I click on it, I can see the details. ^E02:23:08 ^B02:23:14 And I want to take a look at the, what the library's been doing. ^E02:23:21 ^B02:23:26 We'll see the center of the cloud is DBPedia. And we will see some very famous names among the library and the publisher community. Let's see. I think it's better this way. Okay, the green part, the green section is of, the green section consists of the hubs contributed by the library and publisher communities. That takes about, let's see, 20% of the cloud. So that's where we are now. ^E02:24:17 ^B02:24:38 And outside the library community there are some example famous linked data players, and, well, Google Knowledge Card. I do a search, To Kill a Mockingbird, and we will see a list on the left of the links to the documents. And the right hand side is more of a graph, or more of an entity relationship. And the Google already answered some of the user's questions. And that's also tells us about, that even Google is goes through these coexist path. On one side is the traditional and on the other side is trying to do the linked data model. And a big player of this Google Knowledge Graph is Wikipedia. ^E02:25:59 ^B02:26:08 Okay. We will see that later. ^E02:26:10 ^B02:26:17 And if I do our Wikipedia search, we, one thing I want to point out is that we have this link to OCLC. ^E02:26:32 ^B02:26:42 I mean, link to the OCLC record, right here. ^E02:26:46 ^B02:26:50 And towards the end of the linked data listed that OCLC already published it. So every OCLC record, whatever search you do, you will, towards the end, the linked data and we will, at this page, we'll see that we have this predicate and objects listed and some of the authoritative hub names listed. For example, schema. ^E02:27:32 ^B02:27:41 Okay, and another example I want to talk about is BBC. ^E02:27:47 ^B02:28:01 BBC has already publishing it's vocabulary on ontology and also populated all their websites. And this played a big role during the Olympic Games. And so every athlete, every team, has its own website generated with all the dynamic games and statistics listed. And we can take a look at the mapping of the ontology of the BBC. And on the top in the prefix, we see some of the big names of the authoritative ontology. ^E02:28:56 ^B02:29:08 And towards, in the middle, I want to show everybody is the, take a look at the core concept ontology. And there is a graph of how the structure of the BBC site. And on the right you see the entities, person, place, events, organizations and things. Some of these overlap with what we're talking about in the library. ^E02:29:47 ^B02:29:53 So, okay, and we, this morning, talking about that linked data graphs, and the change the data from like, for example, the graph on the left is like resemble what we have as a record, and on the right hand, with all the URI's, so the record as become puzzle pieces. The puzzle pieces can then reconnect with other puzzle pieces and become other websites or another group of knowledge communities. So that gives us a bunch of benefits of linked data. It's easier to get updated. And less human errors in dynamic updating. And also less maybe application cost. ^E02:31:09 ^B02:31:20 I listed here some of the early players of the linked data vocabulary. And if you take a look at any of the linked data example, record example, you'll see these links, you'll see these names. And this BibFrame vocabulary is all what we know, what are we most familiar with. And let's take a look at that category view, which is more for the catalogers. And probably you will find like a journal title being absorbed by another journal title, such kind of properties in Library of Congress BibFrame initiative, right? Welcome cataloger librarians feedback. ^E02:32:21 ^B02:32:37 I think, by now, probably, everybody has seen this image. It's talking about the, how to prepare the data. And is your data 5 star? Which means is your data out in the web? And is your data machine process-able? For example, if it's an image of PDF file, it's not machine process-able, it's better in the CSV file. And your data, is it confined in your IOS rather than with all the links which can be queried, but outside of the library community? So library and data entities, this image is from OCLC, so OCLC's job is focus on the key entities for some place organization concept work and events. And take a look at the data, what we have in the library, I think we can categorize them into these four groups in library authority files, labeled names and half structured and unstructured text. As I mentioned before that with URI's, data with URI's resulted to vocabulary. The process, or the concept, resemble what we have been doing, so authority control. ^E02:34:44 ^B02:34:51 So the focus, or the center of the easy part is the, of all those four categories, the authority file probably is the concentration of our work, because it already have these links to authority control records. So I would like to talk about several examples. So on the top inside one authority record we are familiar with, was the one 010 field. This is both Library of Congress and OCLC, being using this set like ISBN numbers, LCCN numbers, this already been transferred to be URI's. And what about the rest of this? For example, 150 is put in the 650 Bib Record and so it has a link. If we can, if we can apply and identify to the 150, then the 450's unauthorized terms, and 550's is easier because with the subfield of WG, we can use the [inaudible] or S K O S broader property to identify those. So this is, for the concept or subject headings, this resembles, we can deal with those authority records and make them URI's. But what about the names, person, authority records? I have one example for Stephen King here. Take a look at the bottom in SKOS, S K O S URI. And the first, it identifies as a person name as concept. And for other languages, translation of the name, it gives a old label, an [inaudible] also for pen names. So in authority record, where it says the URI, the URI doesn't tell the main label that the rest of them are actually property of the person. So it's different from the subject headings. And I think OCLC has been thinking about this, so it's trying to use a friend of a friend standard to deal with this kind of problems. So at the, still at the bottom, you see in friend of a friend standard, a person's name is in the person label, and then the rest of them, including the pen name, other version, or other language of the name, as the properties. ^E02:39:25 ^B02:39:31 Other libraries, their big library communities have been busy converting the library data to linked data. And we, let's take a look at the British National. ^E02:39:48 ^B02:40:05 Okay, so here, we list the some examples what you can do with the linked data. Take a look at the sample queries so the population of possibilities is endless. So if I'm interested in the 50 other born in 1945, or I can put my own labels, own query codes down here, and I will have ^E02:40:48 ^B02:40:53 the result here, a list of all others born in that year. And because the born year and the author's name and all the author's works, they all different, they become different identities, puzzle pieces. So I can sample them and package them as different result. ^E02:41:20 ^B02:41:29 And one thing I want to mention about Oslo Public Library is that library's already dropped marc, so if you're interested, take a look at their catalog and see how they're doing. And also, you can query their data. ^E02:41:50 ^B02:41:56 Whoa, we are having this com meeting in the Library of Congress, so we are all very familiar with this BibFrame. ^E02:42:04 ^B02:42:09 BibFrame, I've been thinking about this BibFrame, the BibFrame divide the records up and so they like, so the BibFrame is like giving you all this bucket for the different fields. So for the work entity, I am thinking that these records can belong to that bucket. And for instances ^E02:42:55 ^B02:43:03 and library holdings, last slide, the annotations. And all these, when these are become divided and put in those buckets, and all these can then also be resigned, be assigned [inaudible] and then linked to other entities. ^E02:43:33 ^B02:43:59 I think that's the record that we was just talking, we just talked. And here's a little tool, that if you want to get your feet wet, you can compare a marc record into the total expression of linked data. So for, ^E02:44:26 ^B02:44:30 it's quite easy to use, so just use, need to use the Library of Congress record number. ^E02:44:40 ^B02:44:44 It's not the regular 001 record number. ^E02:44:50 ^B02:44:57 So we talk about this, the tool to compare, and also you can, you have marc two in the BibFrame transformation service, marc frame editor, maybe let's take a brief look at the editor. Here's a demonstration site. So you don't even have to register and you can just play with it. Okay, so, so say I have a copy of some work or a copy of an instance of a work and I can just, this is like copy cataloging, and if I'm looking for, say c level rising, ^E02:45:51 ^B02:45:57 it will give you a list of all the records which already have at the Library of Congress. It's just not working today. And I'm experiencing it and I would like this left encode instead of this keyword search. ^E02:46:23 ^B02:46:30 So, let's see. ^E02:46:33 ^B02:46:40 It's not working. So when you're done you can download the record, take a look, so let's see. ^E02:46:53 ^B02:46:59 And at the bottom of that list is marc next, which is the, Jackie talked about this morning. It's the new function in marc edit. ^E02:47:11 ^B02:47:15 And this look quite familiar, and I can demonstrate it on my computer, or I have several slides. So if I click on that marc next, I have this list of tools there. ^E02:47:33 ^B02:47:37 I'll first talk about the testbed, and it, I used it to test it out. And for example, I, oops, I was interested in how that c level rising book, it looks like as a, in BibFrame format, so I download that record and I click on this testbed, and I can specify the base URL as either local or as other authoritative ontology URL there. And then if I press it, it says file has been processed. It's quite straightforward. When I open it up I have this marc record confirmed, transformed to this new format. ^E02:48:46 ^B02:48:51 Okay. So link identifier, this is a very handy tool that you can already be doing. You can insert embed this in the regular work flow like Jackie talked about this morning. Say, also you have a marc record and you want to insert all the sub field zeroes from different sources, they, right now they have this Library of Congress [inaudible], and the pull down list is very long, and I don't think every one of them are working now, but these four listed here are working. So depending which one you trust. So after I process it, the sub field zeros inserted. So if you also want to do Jackie's project in your library, you can try using the marc next, it's quite straightforward, just 1, 2, 3 steps. And this is the original field, so for the marc record, so by comparison you see the sub field zeroes inserted automatically. Okay. As I said [inaudible] I'm a learning cataloger, so I want to get my feet wet, half wet maybe, so I would like to recommend one tour, it's a very important tour. Sebago [assumed spelling], it's like SQL to regular database. Sebago is to link the data. ^E02:51:09 ^B02:51:14 So Sebago carries across databases and pulls structured or semi structured data and has even explored data by carrying unknown relationships and transformed RDF data from one vocabulary to another. So ti has all the good functions that, it's like it's designed for link data. ^E02:51:45 ^B02:51:52 So this is a DBP data site. And everybody can go there and carry DBP data, which is the structured data version of Wikipedia. You can carry whatever triples they have there. Maybe let's try one. Let's see. So I'm trying to, let's see, I want to know Library of Congress, all the predicates and the objects. Of course, I'll limit to say 100, otherwise it's endless. So let's take a look. ^E02:52:46 ^B02:52:51 Oh, right here. ^E02:52:53 ^B02:53:56 Okay. So when I, because Library of Congress is in Wikipedia, so I can carry what's all about Library of Congress behind the scene. So I come here to the DBPedia site and I put the writing a line of code, of very simple code, and I got 100 pieces of statement there. And as we mentioned before, like because all the linked data can be recombined and packaged. So the possibility's endless, so you can query whatever. And let's see. Say, say I'm on Wikipedia and maybe I'm interested in that. ^E02:54:17 ^B02:54:24 Let's see, I can. ^E02:54:26 ^B02:54:40 Copy this. They have been using this in the URL, that piece of information, they have been using that as a URI actually. So, let me try to replace Library of Congress and try that. Oops. See what I get. ^E02:55:10 ^B02:55:27 [Inaudible] got the information of this, and so for use, just to come out. ^E02:55:35 ^B02:55:47 Okay, so probably instead of you asking me questions, I would like to ask you questions. That's easier. So I'll be talking, thinking about the future descriptive practice, or future cataloging work flow. So it will very much different. For example, the crowd sourcing, because a lot we have for these library experience like working in the library, we have all this expertise, and compared to the level of cataloging with marc, we would think how detailed the link data will be. So maybe take the coexisting dual path, maybe using the follow what the big names or COC Library of Congress have been doing, and also we, as individual librarians, we contribute to proud sourcing the description of the most prominent, actually, the library resources consist of the most prominent creative work in the world. See, an example is we have New York Public Libraries. They have this jazz, link to jazz project, so and crowd sourcing using all regular people and users and the librarians labeling all the pieces, jazz pieces, so that the system or the computer can query and it can assign URI's to those information. ^E02:58:09 ^B02:58:18 Thinking of the next day, or future, there are more questions, like digital or published information, what do we do with that? And it's like, it's things, for example, OCLC, we take a look at the end of every record, there is a category as linked data, and you will see schema.org. So we kind of depending on other authoritative hub from other communities who is like a social contract that how much we can trust them. So that's about what I have to say. Thank you. ^E02:59:20 ^B02:59:28 >> Mark Winek: Thank you, very much, Linda. Before we start our question and answer period, we just wanted to note that you still have an opportunity to vote. We noticed that there was a discrepancy in the number of ballots versus the number of registrants. So you still have an opportunity to vote until we announce the results, so if you need a ballot or if you have not submitted your ballot, please raise your hand and someone will come around to you and pick that up. ^E02:59:59 ^B03:00:05 This isn't Chicago, so no voting early and often, just early. ^E03:00:10 ^B03:00:14 So, I'd like to open the floor if Linda will accept questions from the audience. Please just raise your hand. ^E03:00:23 ^B03:00:36 >> Kim Kiley: Hello, Linda. >> Linda Wen: Hi. >> Kim Kiley: Kim Kiley. I'm from the Library of the United States Supreme Court. And I was just wondering if you think that there are any particular, or special challenges, in law librarianship when we think about linked data? >> Linda Wen: About law librarianship. Very good question. As I said, I'm a learning cataloger and I am kind of new in the law library. So for the law libraries, law libraries is a very heavy with periodicals, with all the, like embargo, embargo period, and name changes. I think that it will be very challenging. And also, the periodicals goes to electronic formats more and more. And moreso, we talking about the ISBN, EISBN numbers and all the challenges. From subject heading point of view, I don't have much to say, but from the, maybe from the system or technology point of view, I would like to say I trust the ability of the technology. For example, the BibFrame and other old and ontology categories out of there, they property or their relationship like is author of, resolve by, all the properties are expanding, dramatically expanding, and we, us librarians can contribute our point of view to take care of those details. So I would like to say I trust the technology. >> Kim Kiley: Thank you. ^E03:02:43 ^B03:02:50 >> Mark Winek: Does anyone have any other questions? ^E03:02:53 ^B03:02:56 >> Yes. You said something about what new skills catalogers need in the new environment, but you only touched on Sparkles. So can you elaborate on the other skills? >> Linda Wen: The reason I listed just the one is so that we are not all overwhelmed. I just feel, I highly recommend the Sparkle because of all the functions that Sparkle offers. It can query the, in triples, and across the, across different modules. So for other skills, I think XML, ^E03:03:45 ^B03:03:51 yeah, I think that's about, from my point of view. Scripting language. And at the conferences I heard people who create applications or tools, they all mentioned that they want to attract more people creating tools. So we are, in the library community, we are lack of for tools, a lack of people who are creating tools. For example, for example if I go to BibFrame testbed editor, and I want to look for the subject headings, and I have this list of headers, what if I want other authoritative hubs, I want to query the other authoritative sources, what do I do? Does that, that can be a project or can be, people can create tools and embed there so it will connect to other communities or other group or hubs for authoritative vocabulary. ^E03:05:20 ^B03:05:28 >> Mark Winek: Any other questions? Thank you all, and thank you so very much, Linda. >> Linda Wen: Thank you. ^E03:05:35 ^B03:05:59 >> Mark Winek: So thank you very much. And thanks to all of our speakers today. And now a tradition, at the PTPL Annual Meeting, and for her last duty on the PTPL advisory board, I would like to present you with Linda Geisler, who will present the Historical Moment. ^E03:06:18 ^B03:06:22 >> Linda Geisler: Hi. As Mark announced, this will be my last duty, I guess, I perform, for a PTPL. So we're coming to the end of the 91st meeting of the Potomac Technical Processing Librarians. I'm happy to provide just a brief historical moment. While I was thinking about what to select, I did some searching on the PTL website and I actually found a list of all the members that had served as a chair of PTPL, and since we're here at the Library of Congress today, I was wondering well, how often or how active have Library of Congress staff been in this organization. And I was kind of pleasantly surprised to find that there's actually been 9 chairs from the Library of Congress in PTPL. Actually, the very second chair of PTPL was Charles Martel, which was, I believe that was in 1924. And many of you may know him. He's quite well known because he was a chief classifier. And when the decision was made to develop a Library of Congress classification, he oversaw the development of the new system base don't he existing million volume collection that the Library of Congress in 1897. Over the years, they've had other chairs. 1935 to 36 was Theodore Mueller. 1948 to 49 was Werner Ellinger. Other chairs include Robert R. Holmes, 1963 to 64. Alice F. Toomey from 1969 to 70. She was actually a well known reference librarian that worked on a World Bibliography of Bibliographies between 1964 and 1974, I mean 1964 through 74, right? Another Member was Arthur Lieb, who was very active in our prints and photograph collection. We actually have three that were chair, I guess during what I would call like our recent 30 year period. Some of them actually still work here at Library of Congress. One of them is Andrew Lisowski, who was the chair from 1985 to 86, and he works in automation with our [inaudible] program office. I saw that Vera Clyburn was here today, who was our chair in 2011, and she's the current chief of our U.S. Art, Sciences and Humanities division. And, of course, I'm the last one for right now. But I'm sure by showing this history that we will probably continue to have may of Library of Congress's staff involved in PTPL. As was a recurring theme at the meeting today, education professional development training, it's a very important part for our profession. And I just think PTPL does a really great job in providing an opportunity fro all of us. It also enables us to be involved locally with other colleagues all over the metropolitan area, Maryland, D.C., and to network with all different types of libraries. I mean, we have academic, federal, state, private, public. And I just found that it's a very convenient way to maintain currency and learn the changes in the field as it relates to technical processing. Anyway, with that very positive thought, I'm certainly hopeful that Library of Congress will continue its relationship with PTPL. Thank you. ^E03:10:03 ^B03:10:06 >> This has been a presentation of the Library of Congress. Visit us at LOC.gov. ^E03:10:13