Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Data Curation…Crisis? - Ep 213 image

Data Curation…Crisis? - Ep 213

E213 · The ArchaeoTech Podcast
Avatar
374 Plays6 months ago

Recent discussions with colleagues and the February 2024 issue of Advances in Archaeological Practice had Paul thinking about what we do with our digital data. This is an evergreen topic, and one that we’ve touched on before, but is always good to revisit.

Transcripts

  • For rough transcripts of this episode go to https://www.archpodnet.com/archaeotech/213

Links

Contact

ArchPodNet

Affiliates

Recommended
Transcript

Introduction and Podcast Update

00:00:01
Speaker
You're listening to the Archaeology Podcast Network. Hello and welcome to the Archaeotech Podcast, Episode 213. I'm your host, Chris Webster, with my co-host, Paul Zimmerman. Today we talk about data curation in context of the February 2024 thematic issue in advances in archaeological practice. Let's get to it.
00:00:26
Speaker
Welcome to the show, everyone. Paul, how's it going? Welcome. Welcome to your first episode of the Architect podcast. Oh, wait, that's not it's just the first one in a while.
00:00:35
Speaker
Yeah. Well, no, we did record the, Oh, that wasn't just architect forever since we've done just an architect. Yeah. Wow. Thank you. Good to be back. I know. It's great. Yeah. We, we just basically took a, took a pause. I mentioned this, I think I mentioned this a little bit on the, on the last episode that you guys heard, which would have been a crossover with the archeology show, which is cause the topic seemed to fit. So Paul did that interview with me as you guys heard. And
00:01:00
Speaker
we, uh, cross posted that on TAS and then kind of called it good. But this is our first, like you said, real episode since I don't know, November, something like that early December. Yeah. Yeah. Yeah. I mean, I was out in the field for a couple months and I've been back for a couple months now, but geez, we just have not been able to coordinate my Mondays for some reason are, uh, are just oddly busy. Yeah. And we actually, in fact, I don't even know if you can make this one, but we have an interview next week. It's in my calendar.
00:01:30
Speaker
It's in your calendar. Awesome. So, so the next episode you guys here will be an interview and that'll be good. So we're looking forward to that.

Thematic Issues in Data Curation

00:01:38
Speaker
All right. Well, something we have touched on in the past, it always comes up when you're talking about big data and data sets and, and all kinds of different things is data curation. And Paul, why don't you introduce this because you brought my attention to this latest advances in archeological practice.
00:01:57
Speaker
issue, February 2024, which is linked in the show notes, by the way. So if you want to go take a look at that, it's completely open access. So you can go look at any of the articles, download them, do whatever you want to do and take a look. But Paul, why don't you give us a little bit of an introduction to the topic? Yeah. So this issue of AAP is a thematic issue and it's around data curation. And, you know, there were a couple of articles in it that I thought really stood out, though the whole thing, if you're into this and you probably should be, if you're collecting data, it's worth a look.
00:02:26
Speaker
It just struck me based off of the timing, because I had been gearing up for another field season of Lagash, and part of that is writing NSF grants, and part of that is having a digital asset plan. What's the actual digital management plan? There we go, DMP. And so what we do with our data, how it gets out, how it gets stored, has all been a consideration.

Challenges in Data Organization and Storage

00:02:51
Speaker
We actually, because we didn't
00:02:53
Speaker
have last season, we've taken the opportunity to really think through things in advance. Last week, in fact, I was down in Philadelphia because our field director and our ceramicists both flew in from Italy. So the project director got the entire team together, or most of the entire team, so that we could actually hash out some of these details in advance. And I'd already started working on this with
00:03:18
Speaker
with the field director about how we're going to get our data uploaded to a cloud repository as we collect it in the field and how we're going to save that. We've got internet problems, blah, blah, blah. It's been bouncing around in my head. Then this article came out at a very opportune time. I just wanted to talk about it with you. Since I came back, I've been doing some local work. I was doing some STPs with a guy that I work with every now and then here.
00:03:48
Speaker
And we've been contacted by a vendor selling a database solution. And one of the questions that the salesperson asked was like, what happens to your data?
00:03:59
Speaker
And my colleague said that in his opinion, and this isn't that he's destroying data or destroying artifacts or anything, but if there isn't a system in place for handling with that, he thinks that we have to basically consider if it's not in the report, it's lost. And I thought, wow, that's a bit of a wake up call. But I also don't know what the answer would be if you don't actually have the systems for that.
00:04:25
Speaker
there I go off on a five-minute tangent of why this struck me. But it's a topic, like you said, we've touched on before, and it's a topic that comes up periodically. And I think it's a topic that we as archaeologists are trying to do a good job with the data that we collect, have to revisit periodically. And we're probably going to have to do this forever and ever and ever. And if we haven't started doing it, I think right now is a really good time to start.
00:04:50
Speaker
Yeah. It's definitely a good time to start. I mean, it's 2024 and we're practically, you know, we're approaching, we just crossed the first quarter. So I mean, now is definitely the time if you're not considering your, your data plans and things like that. And, and again, everybody knows listening to this podcast and once you're a brand new listener, which, Hey, thanks for finding the show. This isn't a brand new show. This is episode 213.
00:05:13
Speaker
Even though we haven't had an episode really in a while, but you know, it's odd timing for this as well because like I said, everybody knows I've worked with and consult with wild node and which is an application you used on some of my projects. And they just this spring started incorporating GIS basically an actual mapping within the program.
00:05:34
Speaker
So not only are you collecting more and more digital data, more born digital data with the application, but now they're throwing in mapping and shape files and this Esri integration and all this kind of stuff, writing with the software. And everybody's doing that. You know what I mean? Everybody's doing more and more digital, born digital data collection. And even if it's not born digital, it gets converted to digital at some point because that's just how things get stored.
00:06:00
Speaker
hand it off. I mean, even the archaic Nevada BLM expects things on a CD at the very least still. So, you know, that's data, data that's not searchable anywhere, by the way. And I know, and I think we'll talk about that probably on the show. I definitely took some notes around, you know, data formats and things like that. But yeah, it's a, it's crazy that projects are still out there and exists where people are like, Oh, Hey, maybe we should think about what we're going to do with all this data when it's all said and done. It's almost more important than the actual project itself.
00:06:30
Speaker
Because if you don't have a place to store the data, you may as well not even have done the project, right? Because who's going to find out about it? Who cares, right? Where's the data? Right. That's where we've been fortunate in the misfortune of not being able to go to Lagash last season is that it gave us this space to actually sit down and do this properly. And so the first thing that we attacked was
00:06:49
Speaker
We have a cloud storage and it's organized by season. As we're looking at it, we thought, you know what? That is just not the way that we need to make our data accessible to ourselves. This is not for broader accessibility. This is just for managing our own file-based data in the field and when we get back home.
00:07:10
Speaker
And so

Improving Data Management Strategies

00:07:11
Speaker
we totally inverted it and we split things out into kind of mapping data and photographs and the field records and administrative records like permits and such. And then within those down into the trenches, down into each season because a trench might be open for multiple seasons. But it gave us the opportunity to totally rethink how we're just
00:07:35
Speaker
basically organizing. And fundamentally, I think that that's the biggest problem, not problem, biggest thing to tackle, biggest issue that you have to deal with with your data is how do you organize it so that, you know, you can find it again, right? I mean, how many times have you worked with somebody that's like, ah, I don't know where it is. And so they do a search on their computer to find something. Oh, no, it was on the other computer. Yeah. Like what, what drive is it on? Is it on the S drive or the Y drive?
00:08:03
Speaker
Yeah. Yeah. Yeah. And you know what's crazy about that is that's, that's a huge problem if you can't even find your own data and understand where to find it. Right. Because it might be, it might be obvious or, or logical in the beginning when you're thinking about a project like the Lagash project to say, okay, well, this is all season one's data. Maybe we're not completely done with this. Let's make a season two data file now and let's, let's figure out what we're going to do with that. And that just kind of,
00:08:29
Speaker
Without thinking 10 years into the future, and what does that look like? Are we really going to look at files and try to find something from season one by knowing it was in season one? You just can't really think that far ahead unless you're really actively thinking about it. Now that you guys are, you're going to be obviously way better off going into the future coming up with a good plan like this.
00:08:51
Speaker
It's such a big deal, and like you said, not even be able to find your own stuff on different projects, let alone somebody else coming into your dataset trying to find something, because that should always be the goal, is somebody else eventually coming into the dataset and doing further research and including these data in other metadata studies and things like that.
00:09:13
Speaker
It's just not something that is probably done that often in archaeology because of the lack of availability and searchability of, you know, everybody else's datasets. And that's one of the first things I pulled out of this issue. And in fact, I linked or at least I dropped the citation of the first article in this issue, which was actually a really good just overview. So if you're wondering,
00:09:37
Speaker
maybe which article to read and the titles aren't grabbing you. Read the first one. It's short. It's like four pages and it really is a good summary of everything else. And if you want to know more, you can read the other articles. But one of the biggest problems they mentioned right off the bat is not only different ways to store your data, different
00:09:56
Speaker
different methods and different ways to organize it, but just the different sources of data that are out there, all stored in different ways, by the way. So you've got university repositories, government, CRM, you've got BLM, which is included in government for service, all these different places where these data are stored. And I did a lot of work in California for some, some of it for some private, I wouldn't say private, but
00:10:19
Speaker
CRM firms that are now like one person, you know, as somebody starts to retire and maybe they just want to keep doing some work, but they no longer work for a firm. And this one guy, he used to have a huge company back in like the nineties and he had a lot. I mean, they had many, many employees, lots of projects, lots of things they did, but over time there, the company just kind of dwindled. People went off out of the places. I think it ended up getting actually sold to somebody and he just kept doing stuff. But his office
00:10:45
Speaker
which was one end of his house, had probably 15 file cabinets worth of archeological site data and dating back probably 35, 40 years. And that's the only location for that data. And I don't know what's going to happen when he dies. It's, and that's not unique in California. There are lots of people in California that had many file cabinets full of archeological site data. And that's literally the only place that exists except for maybe the report that is, you know, in some office somewhere at a, at a,
00:11:12
Speaker
distribution center. It's interesting that you mentioned that because that's the third thing that I've been directly exposed to with old data is that I volunteer, I need to again, I haven't been there in a while, at the Institute for American Indian Studies in Washington, Connecticut. They've got a great storage system. They're very conscientious. They've got excellent people that are working there and volunteering there, but they also have
00:11:36
Speaker
tons of artifacts and records from excavations that took place in Connecticut in the 70s and 80s. And so I've helped them when I go there with some of the organization of that, and I don't take any credit for it other than just pushing buttons and opening boxes and writing down what's in them. But that's work that needs to be done, and it was never done.
00:11:58
Speaker
And it's in excellent storage, really surprisingly advanced storage, considering that it's a small museum. But I can only imagine, having seen what's there, how many hundreds, thousands, maybe, places across the country are in a similar state, where they have just, for whatever reason, acquired, been donated, given, sponsored old projects, and now have a big mess of data to deal with.
00:12:28
Speaker
Yeah. And it, to me, the biggest problem is we just, we just simply don't have a standardized data format that, you know, for, for all of your data, a standardized format to put everything in and organize everything by, and some organizations are, are trying to do that, like open context and Dina, to some extent, TDAR.
00:12:51
Speaker
Yeah, TDAR, TDAR for sure. You know, they have specific formats, but just the fact that those three exist and they're completely different. I mean, I know they talk to each other, but this is archeological data. It's kind of the same no matter where you're at, right? It doesn't matter if you're storing data from Logos or storing data from a historic mining site in Nevada. You've got the same elements to the data across the board and you need to come up with consistent terminology and things like that. And I understand there's a huge challenge to that, but it's crazy. You might get these collections in and
00:13:21
Speaker
who knows where they came from. They could have been donated, like you said, or something like that. And there's no way to take the data and just even put it into an accessible format that other people would just inherently understand because that's how it's all done. Because there is in a way that that's how it's all done. Why

The Role of Systems Thinking and Ethics in Data Management

00:13:38
Speaker
don't we take that thought and come back after the break because I have a little bit of experience with that and want your comments.
00:13:48
Speaker
Welcome back to the Archeotech podcast, episode 213. And we're talking about an issue from advances in archeological practice, which you can find a link to in the show notes. So check that out if you haven't done so already. And yeah, Paul, I think you had some thoughts on what we were talking about at the end of the last segment.
00:14:05
Speaker
Yeah, so one of the other threads of the things that we're thinking about, and we're talking about organization, how do you organize your data so that you can find it again, and how do you organize it so other people can find it, is that having this digital management plan is part of what we have to explain what we're doing. We have to have it written out for that NSF grant, but we're also exploring, you know, there's data archiving on the one hand, which is what I was talking about with the cloud storage, but then there's also the archiving.
00:14:35
Speaker
and making it accessible for more people in other ways. So we've been involved with open context and discussing with them how we will host our data there. So what we have is this really complex web of we've got file-based data, we've got our own small databases for recording things, the ceramicist database, the final analyst database. I make databases because I'm an idiot.
00:14:59
Speaker
I make databases to handle all sorts of data that I deal with, survey data. These are just temporary sorts of things. These are conduits, but then they go into a bigger one that's hosted a pen that is supposed to be publicly accessible, and that's more of a long-term storage. But then we also are, like I said, with open context, we have to figure out how to translate our data using their standards so that we can get
00:15:25
Speaker
select pieces of our data hosted by them and accessible to the wider audience. It's this really complicated, complex web of different kinds of data and how they get stored and what gets stored where and why, how it gets recorded. Again, back to the first of these two articles,
00:15:44
Speaker
its title is a systems thinking model of data management and use in US archaeology by Elizabeth Ballwork, Nia Gupta and Jolene Smith. And just seeing that the systems thinking did
00:15:59
Speaker
tickled me because as I mentioned before, before I dropped IT to come back into full-time field archaeology, my job at the school was systems manager. I was the one that knew how if you change the username format here, it's going to break some other system over there. All these little things that you wouldn't know until you've poked and prodded.
00:16:25
Speaker
seeing this forefronted, that you have to think of it as a system of interrelated data management techniques really is appealing to me. It's just innately appealing because it's what I've had to deal with for a couple decades. It isn't an answer. It's not like, oh, all you have to do is sprinkle the systems thinking thought onto how you're dealing with your data. But if you know that you have to think about it as a system,
00:16:51
Speaker
What happens when we change these names? If I collect it in this format or save it in that format, who else out there in the world can use it? Do I have to then translate it to another format that somebody else can use? Do I have to add metadata to it? Do I have to add, and here's a term that I hadn't heard before, but I love it, paradata. That was brought up and that's basically documentation of the processes of data collection. So it's not just like,
00:17:16
Speaker
a photograph might have metadata that you use two things about the camera, the lens, the exposure, the geolocation, the date, time, and so on and so forth. Paradata would be one step above that, I guess, explaining why you're using that camera, why you're taking photographs, what your decision was, your decision process for what needed to be photographed.
00:17:37
Speaker
So it's a higher level thing and they suggested that that also has to be documented in order to make the data, even with good metadata, to make the data more usable. People have to understand that data don't just happen by magic. You don't just press the record button and now you've got your data.
00:17:55
Speaker
Even if you are just pressing the record button, there have been a lot of decisions that went into that before you hit that record button. One of the other things I've enjoyed is another site called protocols.io. It's mostly geared towards people in laboratory sciences for publishing their own protocols, hence the name of different lab techniques.
00:18:15
Speaker
But I think I've got a couple articles up there now for GIS work that I've done. And I think that that's something that people ought to look into for documenting how and why we get our data the way that we do. So anyhow, this whole systems thinking approach was really interesting to me. The problem is at once really complex. And once you start to see the threads of how things tie together, it becomes much simpler, if that makes any sense.
00:18:45
Speaker
But that seems to be the hardest thing that archaeologists and people doing this data collection planning, that's the hardest thing they have to do is really looking outside of their project from a data standpoint, right? I mean, I would say a lot of people try to look at what they're doing and the project they're doing in the grander scheme of things from an archaeological standpoint, right? That's kind of like one of those what does it all mean things when you're writing up the report, right? It's like how does this fit in context of everything else around here?
00:19:13
Speaker
but that's where it stops, right? Nobody thinks about it. They use their data to come to that conclusion, but they don't think about how their data fits in the wider context of things and what their data is going to be used for. And it's almost like, I don't know, I almost see it as, you know, academic projects where it was especially working with, um, with codify and then with wild note, talking to graduate students and I, you know, how many graduate students we talked to that were like really proud of the file maker database they put together for their project. And I'm like,
00:19:40
Speaker
Do you really think you had to invent this for that project? Is that one of the things that you really spend a lot of time doing? I mean, great job. It looks nice. But why did you reinvent the wheel? Why did you do this? Why didn't you not find some other standard that you were going to try to fit your data into so when you're done with it, it actually goes somewhere rather than creating this whole database just for you, just for your own thing that nobody else understands and organize things in a different way.
00:20:08
Speaker
It's like we're trained to think of the big picture from an analysis standpoint, but not from a longevity of the data standpoint. I always think back to that one example that is just always in my mind from Chaco Canyon.
00:20:27
Speaker
with the charcoal and the guys that first, some of the first Europeans that saw that were, were just hanging out in Pueblo Benito and they said, Hey, you know, we need, we're going to spend the night here. We need to make a fire. And so they made a fire where there was already a fire in the room and completely destroyed any possibility of carbon 14 dating.
00:20:47
Speaker
by, you know, completely contaminating the entire thing. And nobody thought later on that you'd even be able to do anything with that. Right. I mean, of course it was the 1920s or 15s or something like that, whatever it was, maybe even earlier. And so how could they have known, but that's the whole point. How could you have known, you know, and it's,
00:21:05
Speaker
It's just not enough thought is put into that when people

Addressing Data Accuracy and Inherited Issues

00:21:10
Speaker
are doing these things. Yeah, which again is why I'm happy that we've had this moment, this forced break to think these things because for the data collection, sometimes it does make sense to have that little bespoke database
00:21:24
Speaker
But that can't be the end product. The one that makes it very easy for the surabuses to collect her data quickly and easily and with exactly what she needs for her analyses is great. But it has to, and so we have, again, two steps up beyond that, one being the database that hosted a pen, and then the other one being open context. So it's going to flow up through that. But we have to think that.
00:21:51
Speaker
how we're going to do that. Otherwise, it's always going to be a challenge, but it becomes a real task to try to figure out how you're going to shoehorn it. But if you can plan it from the start, how it's going to flow, it becomes easier. And that's what I'm saying about the threads in the systems thinking. If you look at how things relate, it can sometimes make things easier.
00:22:14
Speaker
Right. Then rather than getting, I mean, there was another in one of the other articles, there was a term that got thrown out called inherited digital messes. Oh, I know what that is.
00:22:26
Speaker
Oh, yes, I've seen those before. And I guess what we want to do, I mean, you're not going to be able to avoid that because there are digital messes out there that are not your fault, not anybody else's fault. They just exist because they were done 20, 30 years ago and that's the way it is too bad. But if you have the opportunity to not create new digital messes going forward, then that seems to be where we have to be striving to get to.
00:22:53
Speaker
What do you think is the biggest problem with trying to figure out what to do with the dataset before you even collect it? Not even possibly knowing much about what you're about to collect, right? Especially from a CRM standpoint, you're just going to an area. Sure, you can do the research and have an idea of what you might find there.
00:23:11
Speaker
With you having worked in such wildly different areas as compared to where I've worked, Saudi Arabia, other parts of the Middle East, and then various places in the United States and other places in the world, do you think there's a possibility that you could just
00:23:30
Speaker
create a dataset, create a way to contain all the data from a project without actually even caring where you're going or what you're doing and saying, I can fit whatever I find here into this container, so to speak. Is that even something we should be striving for or is that too restrictive?
00:23:49
Speaker
I waffle. I initially wanted to make doing something like that a centerpiece of my dissertation research. And then after a while I said, nope, can't be done. It's too big of a problem. And this is a long time ago now. I mean, we have better tools now. And it's more possible than by one grad student is also trying to struggle with his own data and understand things and read things in different languages he doesn't really know and so on and so forth.

Debate on Universal Data Containers and Standardization

00:24:17
Speaker
Oh, and raise kids at the same time. It wasn't going to happen. It needs to be tackled by people that really know what they're doing. And there's certainly data sciences has undergone a revolution in the last 10, 15 years. And so there are people that are definitely better poised to handling that and answering that in the affirmative than when I looked at it most closely.
00:24:41
Speaker
Where was I going with this? Is it possible? Yes, but the problem again, I think is back at the start is that if you have all possibilities open, you don't know what to collect and what not to collect. You have to be able to put things in a particular bucket, put things in certain fields, put things so that they can go through.
00:25:06
Speaker
you can't present every option right at the start. And if you do, so the school information system that we used when I started in 2000 had been used for a few years and it was very open-ended and very usable by a bunch of different schools. But a lot of the administrators who were using it didn't know really the ins and outs of database practices. So one of the first things that we did
00:25:30
Speaker
when the new director of IT came in, he went and started cleaning up the data. One of the things that people have been doing for a number of years is putting asterisks and dollar signs or things on people's names to indicate if they were a new hire. Because they didn't seem to know that there was someplace else in the record, the hire date.
00:25:56
Speaker
And that you could use that instead to filter out. But then they had to go through crystal reports in order to get that out. It was much easier for them just to get a list of the names and then manually pull out everybody that had an asterisk. And so that gets at training, that gets a data use, that gets at the design implementation of the data entry. These are all interrelated against systems thinking. These are all interrelated problems that
00:26:24
Speaker
if not done right, make it really hard to collect the proper data in the field. So is that higher level goal attainable? Probably, but it has to be thought from the ground up. Right. Well, I have another question for you. We'll do that on the other side of the break when we wrap up this topic back in a minute.
00:26:50
Speaker
Welcome back to the Architect podcast, episode 213. We're talking about the data curation crisis, so to speak. And so leading off the question I asked you at the end of the last segment, Paul, about, you know, can there be a, essentially a universal container?
00:27:07
Speaker
for data. And one of the things, just kind of like leading up to that, that I can bring up is there's, there's been a number of roads converging on the same spot for me in my life as an archeologist. And the first one was getting into digital data collection. Well, I guess, I guess if you go back even further, it was my first few years in archeology doing CRM. I've worked in a total of about 18 or 19 different States and every single one of them has a different site form. Now out here in the West, it's a little bit,
00:27:37
Speaker
It's a little bit more similar because a lot of the inner mountain region ones were based on the same form, but they've kind of diverged from there, but they kind of all collect the same information still just in different fields, which leads me to what I'm about to say here. But everywhere you go, there's a different way to record sites. Now, some of them have to have that because you're not actually recording sites out in the field in some cases, for example, shovel testing. You're really defining sites mostly when you get back into the office and you're trying to figure out what you have and
00:28:04
Speaker
you know, things like that. Your shovel test might initially define a site boundary, but after that,
00:28:09
Speaker
You're not really filling out a site form, so to speak, in the field like you do out here in the West a lot. But that being said, you're still recording the same things. You're still recording all the same stuff. And when I was looking at producing an application, and I've gone down this road in four different ways, and in every single case, my latest one being with WildNote a few years ago, was trying to create a universal site form. I went to the point where I had a whole spreadsheet of
00:28:39
Speaker
basically trying to have similar terms on one axis, like the actual fields on one axis with my universal site form fields coming across the other axis and trying to figure out, are there any truly unique fields on these site forms that can't be dropped into a universal container? And in most cases, the answer was no. And in most cases, the
00:29:01
Speaker
The reason why the site forms were so bespoke is because they just got into incredible detail on the actual site form. Whereas

AI's Impact on Data Collection and Interpretation

00:29:08
Speaker
some were a little more generic. Is this a feature? Is it prehistoric or historic? This is all they cared about as long as there was a description. But you get to Utah and Nevada and they want a massive amount of detail, at least on their older site forms they did. The newer ones are a little more succinct.
00:29:23
Speaker
which ought to tell you something to begin with. But it was all leading towards can we make a universal site form that would then start leading us towards this universal data collection and storage method. Because I feel like we can't even talk about a universal way to collect and store data without having a universal way to collect data. You know what I mean?
00:29:48
Speaker
It's another thing, talking about Wild Note and doing GIS and stuff like that, and understanding things like data dictionaries. I mean, every company has its own data dictionary for their trimble usually, right? And it's like, why do we all have different ones? Why isn't it all the same one that is mandated by the agency that you're working with or something like that? And some may, but in general,
00:30:09
Speaker
a lot of companies just have their own. Doesn't anybody see a problem with that? That's a different question, but do you think that we could do something like a universal site form or would that be too restrictive, similar to what I asked you in the last segment? Yeah, I think that my answer is going to be the same as what I was saying in the last segment. Either it's going to be too restrictive and you have to then force things into fields that they don't belong, like those asterisks on people's names. That's not because that database was restrictive, but because
00:30:38
Speaker
the administrators didn't know of and didn't see that date hired field. So in effect, it was restrictive. So you either end up with a situation like that, and then people are going to be shoehorning data into places they don't fit in order to have it recorded, or else, like I was saying, you make it so it's all wide open, and then you don't see the fields that you need.
00:30:59
Speaker
I've dealt with systems like that where I have to scroll through page after page to find the bit that I actually am recording on this site because all the other possibilities were listed. Neither is perfect. That's where I'm falling now with the bespoke not being terrible, but maybe there is some intermediate there where it's
00:31:29
Speaker
you know, the whatever federal agency is mandating, it says it needs these things recorded in this way. And whether you're doing an historic site, prehistoric site, you're doing a survey for who knows what you're doing trouble tests versus you're doing one by ones or, you know, so on, you select and say, okay, we're going to need data x, y and z for this one and a, b and c for that one. Oh, and they're both going to need q, you know, so they're
00:31:56
Speaker
different forms presented to the people actually collecting the data that strip out the extra garbage but present what's absolutely necessary in the most necessary way possible and most visible way possible. That's a good feeling at this point.
00:32:13
Speaker
Well, I mean, there is, I feel like there does have to be a, and I think people inherently know this, but there's an acknowledgement between collecting data and analyzing and interpreting data, right? There's definitely a big difference there. And it's leading me to think about another piece of software that I work with, kind of my new day job, so to speak, and it has nothing to do with archeology, but we've got a new feature. Let me just,
00:32:39
Speaker
Let me just give you an example here. We build modules basically that handle a concept. So let's say you've got one module that's doing one thing. In general, it's about the same for everybody. It's like recording a site. It's about the same for everybody with some slight differences, right? Well, there's this new AI summary feature where it will basically take
00:32:58
Speaker
all of the forms that were put together for this one record, we call it, and there could be any number of subforms and pieces of data and different ways different companies have rephrased things and done stuff to just fit their own needs, right? Very similar to what we're talking about. But the AI summary just looks from beginning to end and summarizes this entire record in human readable terms. And it'll mention, for example, first names in the first paragraph and maybe a few paragraphs down, it'll just go down to first name. It gets very conversational.
00:33:27
Speaker
very familiar. And it just puts it in this really easy to understand, shockingly good summary of the entire record down into about four or five different paragraphs. And I'm kind of wondering, man, if we just had
00:33:42
Speaker
a universal way to collect everything in a way that didn't limit what we wanted to say about it. It's still allowed description, still allowed things like that. And then instead of doing, I don't know, instead of doing some sort of overly academic, complicated analysis, just see what I can do with it.
00:34:01
Speaker
and just say, hey, have a summary button. I'm half tempted to make an archaeology module out of this software because that's what I do is I configure it. And just to see how it would interpret certain archaeological sites if data were put in there, right? Just to see what it would say. I wouldn't be able to interpret in context of maybe everything else that's been collected out there because most of that is behind some kind of wall that, you know, you can use chat GPT all day long, but it doesn't know anything when it can't get to the actual data, right? So, you know, it's only as good as what's in there. But I just wonder if
00:34:31
Speaker
our future lies somewhere down that road, right? Where we really are archeology data scientists and we're just collecting data to put into this database that will be analyzed, interpreted from a much bigger standpoint in a way that we never could. Possibly. I mean, I'm actually thinking about the other way around. I wonder if AI could be used to help streamline that process so that you're only collecting the necessary data and you can do it in the easiest way possible. And now the backstory to this is that my dad,
00:35:01
Speaker
was a physician and in the 1990s who's working with a large pharmaceutical company to develop what we called back then an expert system in order to collect data on diabetes patients and to monitor them.
00:35:16
Speaker
with their meds, their weight, a lot of different factors around their health. And so that database, that expert system could be distributed to rural doctors, to general practitioners, to people who weren't androchronologists, who weren't specialists in diabetes care in order to make sure that things didn't fall through the cracks. And so that's kind of the same sort of thing that I'm thinking about now with archaeology. Maybe you could use AI to generate those
00:35:43
Speaker
I'll still call it expert system, so I'm sure there's a different term of art now in order to help streamline and make more efficient and make more comfortable.

Ethical Obligations and Data Sharing Principles

00:35:54
Speaker
I actually hate recording data in the field. It always stresses me out. Even the simple stuff like in one cell stresses me out to no end because I never see the same color as any of the damn chips.
00:36:06
Speaker
But if I could have some way of getting the right prompts and being prodded along and making the decision, it's 10-Y-R, four-three, not four-four. That would definitely help me.
00:36:22
Speaker
So maybe we do it at both ends. I'm sure that's going to be happening. I'm sure that what you're talking about for the analysis part is starting to happen already. Or at least a synthesis, if not full-on analysis, at least turning it into something that somebody else can read and understand the basics of reading the summary or the abstract of an article.
00:36:45
Speaker
That could be useful and then having it at the start end for helping collect the data. That's an interesting, I don't know where we're going to go with that, but I'm sure that that's going to be worked on. Yeah, I have no doubt.
00:36:58
Speaker
So another thing before we round out that I just want to bring up here is something that runs through a lot of these articles is that we have an ethical obligation to share collections and data beyond just our profession. And that gets to your question here about the AI. And
00:37:15
Speaker
One of the articles starts out pointing out the obvious. It's Anthro or Archaeology 101 is that what separates archaeologists from looters and tomb robbers and so on, just pot hunters in general, is that
00:37:31
Speaker
We record things and because we record things and we record things in as much detail as possible We've kind of dug our own hole, you know in a good way. I mean, I'm glad that we're recording So much but I'd rather have it that way than the other way around but we absolutely have that ethical responsibility to think through what the rest of that means and then another related threat is Gets brought up one or both of these the fair and care principles. Let me quickly
00:38:01
Speaker
pull open a webpage. Fair is findable, accessible, interoperable, and reusable. Care is collective benefit, authority to control responsibility, and ethics. I'm glad that those have worked their way into a lot of our discussions.
00:38:17
Speaker
In fact, it was one of the things we had to say in the NSF proposal was, hey, how is our data management plan going to address these? And so it's becoming foregrounded in a way that it wasn't in the past. And I think that's only a good thing for our field. How do we deal with this deluge?
00:38:34
Speaker
You know, unfortunately, I don't have good answers. And at least none of the articles I read, do I recall there being good answers or be a lot of good like proposals, things that we should be doing, things that we really should be doing better, but not a not a like one size fits all. Hey, you know, if you just do X, you will suddenly have good reusable, well curated data. It's much more complex than that.
00:39:01
Speaker
Do you have any ideas, Chris, aside from holding out the hope that AI fixes it for us of how we might start addressing these things? I'm not sure there is an answer, to be honest with you, only because we have so many different places. Going back to one of the very first things I said and that I actually read in this issue was there's so many places to actually store the data, so many agencies that are mandating the collection in a very certain way.
00:39:30
Speaker
for all this to happen, I mean, we would need almost a federal level understanding of, Hey, you're going to collect and store things this way. And then even that breaks down when you talk about, you know, private projects in say California and places like that. It's just, it's just almost no way to do it. So I think rather than having
00:39:51
Speaker
AI summarize like a site record or something like that, like I was mentioning, I think AI would probably come into play and actually just analyzing these datasets. So you can query the dataset with an AI rather than say with an actual database,

Future Directions in Archaeological Data Management

00:40:07
Speaker
right? And you just say, hey,
00:40:08
Speaker
Is there anything, and you say it in human readable terms, was there anything found that's similar to what I've got over here? I mean, you literally should be able to just say that, right? Cause it can understand what you've got and I understand what's in the other thing. And then just like let you know if you need to dig any deeper, right? Because I don't think we're,
00:40:24
Speaker
I don't think we're going to get to a point in a really long time where everything is collected in a way that is universally searchable and understandable by people, let alone all the past collections that we'd have to worry about. Those are gone unless something else can organize them.
00:40:41
Speaker
I think that's where we're headed. I think that's where we'll have to be headed if we want to understand large data and how well that's working. I think AI is going to solve that problem for us. Well, I think we're going to have to wrap this up. I could probably go for another hour, but I'm looking at our notes here and I've got more things still unchecked that I would like to talk about than checked.
00:41:04
Speaker
But so maybe we'll revisit this and hit some of these other ones in an upcoming episode. But there are some interesting things that I learned in reading these articles that I would like to discuss with you some more.
00:41:16
Speaker
Yeah, indeed. Me too. Maybe we can get our producer to take a look at this and see maybe you can bring a few of these people on to talk about these subjects further, because just looking at the names of the people who have written some of these articles, we've interviewed at least a handful of them. I recognize at least four or five different names on here that I know that we've interviewed, at least on this show. If not on this show, then on other shows as well. And so I think we can
00:41:41
Speaker
We can get some people on here to talk about this. Josh Wells, I was just emailing with him. Yeah. I'd always rather have experts talking about their work and their understanding than me bloviating. Yeah, me too. Yeah. If only AI would just summarize my thoughts, I can just blather on a podcast and then have something intelligent come out. There you go. There you go.
00:42:03
Speaker
All right. Well, thanks, Paul. It's good to have you back on. Thanks, Chris. Good to be back. Hopefully we can keep this going this time. All right. All right. Take care. Thanks a lot, everybody. Bye. Thank you. Bye.
00:42:18
Speaker
Thanks for listening to the Archeotech Podcast. Links to items mentioned on the show are in the show notes at www.archpodnet.com slash Archeotech. Contact us at chrisatarchaeologypodcastnetwork.com and paulatlugol.com. Support the show by becoming a member at archpodnet.com slash members. The music is a song called Off Road and is licensed free from Apple. Thanks for listening.
00:42:44
Speaker
This episode was produced by Chris Webster from his RV traveling the United States, Tristan Boyle in Scotland, DigTech LLC, Cultural Media, and the Archaeology Podcast Network, and was edited by Rachel Rodin. This has been a presentation of the Archaeology Podcast Network. Visit us on the web for show notes and other podcasts at www.archpodnet.com. Contact us at chris at archaeologypodcastnetwork.com.