What’s the future of data publication and sharing? And what does this mean for traditional journals and books? This session will explore questions such as what constitutes useful data, the role of data in open research and how financial sustainability can be achieved.

Speaker abstracts:

Grace Baynes, VP, Research Data & New Product Development, Springer Nature
Abstract: ‘Are data really the new oil?’
The huge potential of open research data will only become reality when sharing data is as commonplace for researchers as the publication of research articles and monographs. We need to do more to make it easier for researchers to share their data, and more obvious why it is worth their time and effort. Based on surveys of over 11,000 researchers we’ll share researchers’ current attitudes and experience of data sharing, what motivates them and what is holding them back. Drawing on industry initiatives and Springer Nature’s experience, what actions are journals and publishers taking, what are the costs, and what role should they play: enabler, enforcer, service provider?

Mark Hahnel, Founder/CEO, figshare
Abstract: ‘Data are the new papers’
With funder mandates driving academics to make all of their research outputs openly available, what does the emerging space of academic data publishing look like? Who are the players and who is responsible for making sure data is FAIR? Most importantly, who pays?

Simon Kerridge, Director of Research Services, University of Kent
Abstract: ‘Data Data Every Where, Nor Any Byte To Eat’
With the proliferation of “open” comes the data tsunami. FAIR is easy to understand but (very, very) difficult to achieve. A lot of effort is needed by the researcher to enable data to be reusable, the importance of research data management plans cannot be underestimated, but alone they are not enough, researchers must be persuaded to put in the extra effort, particularly on the metadata front – but how?

Paul Stokes, Senior Co-design Manager, Jisc
Abstract: ‘Data – cost centre or profit opportunity?’
A scene setting provocation exploring the data conundrum. In particular: the disconnect between the cost of producing & storing data, and the benefit non-contributors gain from it; the (hidden?) value of the data and how one might go about calculating a figure for the value and utilising the value.

Download slides:
Parallel 3b – Grace Baynes
Parallel 3b – Mark Hahnel
Parallel 3b – Paul Stokes


Mr. Dan Pollock
Chief Digital Officer and Open Access Practice Director, Delta Think
Dan is the Chief Digital Officer and Open Access Practice Director at Delta Think, based in London, UK. His strength lies in helping publishing companies successfully explore and exploit new digital opportunities, through his combination of commercial expertise and technical understanding. Over the last 20 years, Dan has specialized in digital publishing strategy and product innovation for companies primarily in STM and Legal publishing. He has held executive leadership and strategic advisory roles, and senior positions managing products, operations and change in organizations ranging from dot-coms to complex corporates. His expertise includes the establishment of new business units and product management functions in large organizations, handling digital extensions to flagship brands, the implementation of analytics and social media practice and policies, business and price modeling, interfacing between business and technology functions, and strategic advisory services. Most recently, prior to joining the Delta Think team, Dan was part of the leadership team of Jordan Publishing Ltd through its successful sale to LexisNexis (RELX). He has previously worked for Macmillan Publishers (Nature Publishing Group), Outsell Inc., RELX (Elsevier Health Sciences, LexisNexis), Times Mirror International Publishers. His early career was in software engineering, specializing in publishing solutions. Dan has been the principal architect of the Delta Think Open Access Data & Analytics Tool and supports market intelligence and strategy projects related to Open Access.


Grace Baynes
VP, Research Data & New Product Development, Springer Nature
Grace is responsible for Springer Nature’s approach to research data, including advocacy for open data and good data practice; journal data policies; and data publishing including the journal Scientific Data. Her new product development responsibilities are currently focused on developing research data services and solutions for researchers, institutions and funding organizations, and establishing our new product development approach for researcher services. Grace has spent twenty years in publishing, sixteen of those working in open research, joining open access publisher BMC in 2003, and since then in roles at Nature Publishing Group and now Springer Nature.

Mr. Mark Hahnel
CEO and Founder, figshare
Mark is the CEO and founder of figshare, which he created whilst completing his PhD in stem cell biology at Imperial College London. Figshare currently provides research data infrastructure for institutions, publishers and funders globally. He is passionate about open science and the potential it has to revolutionise the research community.

Simon Kerridge
Director of Research Services, University of Kent
Simon has been a research manager and administrator (RMA) for 25 years, the past 7 leading the research office at Kent where he is responsible for all aspects of the research support including pre-award, post-award, information, strategy, assessment and governance. He has a passion for RMA as a profession and holds many national and international affiliations with acronyms that he would be happy to explain, including: ARMA, CASRAI, CRediT, EARMA, JoRMA, NCURA, OSB, RAAAP, SRAi, and was an author of the Metric Tide report.

Paul Stokes
Senior Co-Design Manager, Jisc
Paul leads on Preservation for the Jisc Open Research Hub (JORH) with a particular interest in costing, sustainability and establishing the value and impact of research data. Paul was the project co-ordinator for the highly successful European “4C project – A collaboration to clarify the cost of curation.”

View Transcript
[00:00:00.49] [MUSIC PLAYING] [00:00:16.01] DAN POLLOCK: Thank you very much. Welcome, everybody, to this session. Does data mean dollars, or should it just mean data? I’m sure you’re aware, we’re starting slightly late. The technology deliverable from the previous session seems to have landed slightly late. Some of you might say, so what’s new?
[00:00:34.26] So we like to look at data. What I want to do is give a very brief introduction, and then we’ve got some terrific speakers lined up to examine it. So what is data? Why is it relevant? Data is really anything, any information that’s generated in the research process. I came across a definition from the Belmont Forum– “any digital research outputs that would realistically be required in order for the relevant article to be validated.”
[00:00:59.64] I thought that was an interesting definition, because it does really speak to the link between the data and the research process and the publisher, which, of course, is why we’re all here. I think, historically, data for publishers has been something you lock away in a supplementary info file and just hope it doesn’t break your production systems. But unfortunately, that is really no longer acceptable. Data really is something that needs to be managed properly in its own right. It needs to be linked properly into papers. Most of the major funders now will have some sort of policies around data management.
[00:01:33.45] So they are bearing down on their researchers to do something with this data stuff, but quite often, it’s only at the point of publication the researchers may actually think about data sharing as well as showing actually the research in the form of a paper. So I think data, in its broadest sense, is something that we publishers are increasingly going to have to engage with.
[00:01:55.17] So we’ve got speakers representing a variety of viewpoints. And I propose we’ll hold all the speakers end to end, and then if there’s time at the end, we’ll take what questions we can. We’ve got a bit of a hard stop at 20 past 12:00, because I believe this room is being used for the ALPS meeting, so no pressure there. Speak quickly, please.
[00:02:15.80] So we’ve got representatives from the consortia side of things, Paul Stokes from Jisc, Simon Kerridge, who then will represent, if you like, the research, your perspective from the University of Kent. Grace Baynes from the publisher perspective. And last but by no means least, Mark Hahnel from the vendor perspective, if you like.
[00:02:34.95] So I will stop speaking, and without further ado, I’d please like to introduce Paul Stokes to kick things off.
[00:02:40.89] [APPLAUSE] [00:02:44.29] PAUL STOKES: OK, first up, first of all, a little bit about me and where I come from. I’m from Jisc. We’re the infrastructure provider for HE and FE in the UK. We also have various products we are pushing at the moment. I am particularly involved in the research data shared service that is now called the Jisc Open Research Hub, all about publishing data, underpinning research from the point of publication.
[00:03:09.10] My current interests relate to data management, preservation, costs, and sustainability. Anyone who’s heard me before knows I keep on banging on about costs and money. And first, I decided to do a bit of scene setting. We have a lot of data. Technicals have a shitload of data. We have– we’re generating more data now in the past 10 years than we have in all of humanity’s history up to the 10-year point. And that statistic has held true for the past couple of decades. It’s vastly increasing.
[00:03:44.75] We are also reaching the limits of storage media. Yes, there are new technologies that allow you to cram more data into more stuff. We can, in theory, store data at the atomic level now. However, a little stat on that one– in 181 years at the current rate of increase, we have used every single atom in the earth to store data. So it’s not sustainable.
[00:04:08.97] So the data conundrum in a nutshell– first of all, the cost side of things. It costs money to generate data. Therefore, it leads us to believe that data has value, probably. But what is this value? How do we derive it? How do you understand it? Well, value is derived from coherence, discoverability, usability, aggregation, linking, and analysis. But it’s difficult to place an actual pound, shilling, and pence value on that data. We believe it’s got value, but we don’t really know what it is. And I’ll return to that very aspect in a minute.
[00:04:49.02] Another part of the conundrum– it costs money to keep data. It’s not a free thing. You can’t stick all of this and forget about it. Sorry if I’m galloping along here, but we have short time. Preservation is an active process. You have people doing it. You have devices. You have technology doing it. It costs money to do that, CPU time for format migration, storage multiple copies, curation and management.
[00:05:12.98] And grant funding– to come back to the old problem I have in my field– often prohibits payment beyond post-project. So you can’t, in theory, pay money after the project has ended. There are models that pay once, keep forever, but are they sustainable? I suspect not.
[00:05:31.46] The future value of data– we’re paid to produce it. We’re told to keep it, because it might be valuable or useful in the future. But the problem is forecasting that future value is very difficult. We don’t know who is going to use it. We don’t know why they might use it. We don’t know what they’re going to use it for. And we don’t really know if there’s a credible return on an investment. The big problem for me is that those who benefit from the data we’re storing and preserving aren’t the ones who pay to generate it and store it. That’s a major disconnect in this area.
[00:06:06.15] There are risks associated with losing data. GDPR– big thing in many people’s minds at the moment. Fines, compensation, reputational damage– how do you put monetary value on that? Future litigation and statute requirements– there are major risks that people are starting to get very worried about.
[00:06:24.81] Project funding is often a model people talk about insofar as data is concerned, but it is not a sustainable business model. It runs out. It can’t be used post-project.
[00:06:35.84] Mandates– one reason we keep data is being told to keep the data. The government has told us we must keep it. The funder has told us we must keep it. But that’s not a sustainable business model. Political expediency may mean withdrawal of funding at any time for any reason. You cannot rely on that as a business model. And what happens when the money runs out? It gets switched off. We lose the data. Oh, dear. Not so good.
[00:07:03.75] So the specifics in the particular region I’m working in, HEIs– they have a major problem in terms of what it is they’re storing, where it is, what it’s doing there, what the value of it is, and what the risk is associated with it. And for me, it all comes down to the money. Sorry about the very bad pun there, Jessie J. My daughter told me this one.
[00:07:33.20] So we need a business model or business case to justify what we’re spending. In theory, money in value of the data should exceed the costs of storing and keeping the data. But I don’t know if it does, and I don’t know if we can prove it does. Major problem for me.
[00:07:53.22] So how do you think about the value of data? Well, this is based on Gartner. There are various ways you might value data. There’s the intrinsic value the data, what it cost us to create it. There’s the business value performance. And don’t worry about reading this. I assume slides will be circulated afterwards, so this will be there on the slides. Don’t worry. Market value– but this is all very difficult to understand and to put actual numbers on.
[00:08:19.89] So I’m working with various partners in my area to try and put an end game together where the institutions will know what they’ve got, they’ll know where it is– and believe me, that’s a problem for many institutions. They’ll know its value. They’ll realize its value. In other words, put some– put it to use. Make sure they can derive some value from it. Use appropriate systems. Get it organized.
[00:08:44.97] I remember I mentioned earlier on about having coherence, having discoverable data, having usable data. And they’ll add value through aggregation, linking, and analytics. So how do we get there?
[00:09:00.98] And one of my colleagues always told me that always have a cat in your presentations. Everyone loves a cat in presentations. So why have I got this particular one here? Well, this one is called Puzzled Kitty, because I don’t have a definitive answer about how we realize the value of data, how we value the data itself, how we find it all, how we solve this problem. I’m hoping that the discussion we’re going to have in here today and the other people here today will help us derive at least a pathway towards that answer and think about possible innovative ways of valuing that data.
[00:09:39.56] One of my colleagues in the data– in the digital preservation coalition has been working on innovative approaches, such as insurance valuation. One way of realizing or finding out the value of data is ask someone to insure it. They’ll soon tell you what they think it’s worth.
[00:09:56.42] Or tax valuation– if you have a collection in a memory institution that has been donated to the nation, the taxman has put a value on that. And it is actually something that the taxman is working on at the moment is how you put a value on a data collection. And short answer is– no spoilers here– they don’t know.
[00:10:15.23] So imagine that if data could be treated as a tangible asset on your balance sheet, something that you could actually put against your losses, put against your profits, and so on. There are different ways we should be thinking about this. So it’s time to raise awareness, kick start the discussion, hence this session here. Now this slide here was all about asking questions, but we’ll have questions later. So that’s my bit. Gallop through.
[00:10:40.31] [APPLAUSE] [00:10:50.35] SIMON KERRIDGE: Excellent, OK. Ooh, zoomed through. I was going to talk a little bit about a cat or dog, but I put an albatross in instead. I’ll talk a little bit about data proliferation, but not very much now. I’ll focus quite a lot on FAIR and then incentives is hopefully going to be my takeaway message. So yes, OK. Data tsunami– we can skip over these. Lots of tera, zeta, goodness knows what bytes, and that’s all been covered. So I’ll save some time. I’m feeling very happy.
[00:11:26.50] We all know about FAIR, don’t we? Yes, we do, Simon. Yeah, good. OK. Fine. Carry on. I have those letters sort of coming through my slides as a theme. So for those of you who vaguely saw on the first slide, I’m a research manager and administrator. So I’m not really a library type, although we do have some shared support with the library. So I occasionally go in there. I know we’ve got a building.
[00:11:51.81] What I look after is funding, so research funding, and, of course, data is associated with that as a scholarly output. But who knows what a research manager and administrator actually does? OK, so who cares what a research manager– OK. OK, right. OK, OK. So this could be an interesting question, then.
[00:12:12.17] So you might think yourself, that was a really interesting presentation, or he was mumbling, and I couldn’t quite hear what he was saying. I’d like to find out what a research manager and administrator is. I’m sure there must be something on the internet that will tell me. So you type into your favorite search engine. Mine is DuckDuckGo, rather than Google, just to be different. And you type in, research manager administrator, what are they, who are they, and you find out there’s some associations. Well, that’s all very well and good. Everybody’s got an association.
[00:12:42.05] What you really want to know is, what do these people do? What kind of makeup do they have? What sort of people are they? So of course, you go to the repository of all things useful, Figshare. You can pay me later. It is wonderful, though. And you type in, who are research managers and administrators. And you get some stuff, which is actually not particularly useful, I’m afraid to say, because it’s not a well-known phrase. It’s not a thing that people know about, which is why you’re about to find out.
[00:13:10.13] So obviously, what you need to type in is R-A-A-A-P. Yes, OK. You didn’t know that, and there’s no reason why you should. And this is the problem in terms of the findability. I happen to know that that’s quite a useful acronym, because it’s a project I did, and that’s why I’m using it as an example– Research Administration as a Profession– so you can find out about the profession of research administration. I’ve confused you already, because research administration is what we call ourselves in America, whereas we’re research managers and administrators in the UK and Europe.
[00:13:40.21] So you then find some– oh, a little poster and stuff like that. So you can then click in, and you can find some data. And you go, that’s marvelous. Excellent. I’ve got some data, which will tell me about research managers, administrators, 2,691 responses about what we are and what we do. Excellent. That’s exactly what you want. And there it is in SPSS, and you go, oh, great. SPSS. Doesn’t have an SPSS license, or do I? Do I need to install SPSS in order to be able to see this? You can see it on Figshare. You can download that file, but then you can’t do much with it, unless, of course, the kind researcher has put in– sorry, yes. I’ve skipped over a slide there.
[00:14:15.69] Also, looking back at the data, you see things like– oh, have I got a pointer here? No– current role level. Current role level– it’s either one or two or three. What does that mean? Nobody knows. But of course, a good researcher has given you a code book to explain what all these things are, so you can then translate that. Sounds like a bit more work.
[00:14:35.19] You then find out what all these things mean, so one means I’m a leader person. Two means I’m a manager. Three means I’m operational. So you can look at the differences between people as they move through their career of research management and administration. So that was really exciting. And you sort of have the definition of those variables as well.
[00:14:53.50] So given that there were a few steps in there that you probably wouldn’t have been able to do yourself, how findable and accessible was that information, even though it’s been explicitly put there for people to find? The answer is, not as good as it could be. It was a lot of work for me to put it up there. If you try and find that, even having listened to this presentation, unless you remember the five-letter acronym, then you’re going to be in trouble. So there is a lot more work needed in terms of that sort of cleverer findability sort of stuff, particularly for areas where there are not– it’s a new area, and you don’t have those sort of semantic schemes of what things mean.
[00:15:37.44] So what do we do about enabling this? Well, we’re going to– we’ve heard a little bit about the technology. We’re going to hear, I’m sure, about the linkage between scholarly output and data, so the data underlying the publication. But I think we heard yesterday from Steven, the sort of– the unregulated or unleague-tabled actors are the funders.
[00:16:05.03] And they’re the ones that can really make a difference. So we now have lots of different funders requiring us to have research data management plans and requiring us to deposit the data and make it openly available. And lots of publishers also have that as part of their general criteria as well.
[00:16:21.74] In the UK, we are extraordinarily lucky– I’m sure somebody will quote me now– to have the impact agenda, where it makes it really important for the data to be available, because then it’s more likely used by somebody, if they can find it and understand it, and for that data to then go on and make a difference in the real world.
[00:16:42.05] I mentioned publishers, journals. Also, institutions may well have their policies to help support researchers, so there are quite often now research data management professionals sitting in the library or elsewhere at an institution to help researchers with that bit that they really don’t want to do, because they want to get on with their next bit of research.
[00:16:59.79] And it all comes down to this sort of research culture and that public good, which is, yes, of course, I want to make my data available, but I also want to get on to the next piece of work. And also, for me, as an academic, he says, not being an academic, but for me as an academic, why should I make this data openly available, easy to find, rather than write that next publication? Because it’s the next publication that’s going to help me with my promotion, not the data, until we change the institutional culture. And I think that’s probably all I wanted say, so I’ll leave it at that.
[00:17:31.32] [APPLAUSE] [00:17:40.55] GRACE BAYNES: Save. Good morning or just about good afternoon, everybody. My name is Grace Baynes. I work for Springer Nature. We are a publisher. We’re an academic and education professional publisher. We’re also incredibly committed to open research, and my focus is on how we make the data part of open research happen.
[00:18:03.78] So are data really the new oil? You’ll notice the plural. I should go back, actually. You’ll notice the plural, are. I’ve been taught this by the research data team, who tell me that data are not an is. They’re an are, so– and then I’ve blithely ignored that later on in the presentation.
[00:18:23.07] So does or do data mean dollars for society? Yes, basically. So classic example– we don’t have enough evidence for this, but the classic example is the Human Genome Project, which, it’s estimated, returned a trillion dollars in value to the US economy in the first decade after that data were made openly available. And that’s almost a 300 to 1 return on investment. Many of us would like to get 300 to 1 return on investment.
[00:19:02.10] Looking forward– and this is speculation, and I was very interested, Paul, in what you were saying about not being convinced of the value, and it still needs more demonstration. I totally agree with you. So there are estimates that the European Open Science Cloud, by expanding to include Copernicus Earth Observation data, could, by 2030, generate $30 billion in added benefit to the economy and 50,000 jobs.
[00:19:30.23] There is some precedent for this in that open government data, we know, has genuinely led to innovation, led to the creation of new jobs, led to small and medium enterprises in particular being able to capitalize on that and form very valuable businesses out of this. So what about for the research community or research communities? Does data mean dollars for them in terms of benefit, in terms of cost? The cost is really important and poorly understood and hidden.
[00:20:10.83] Reproducibility is often used as a reason why good research data practice makes sense. Nature did a survey back in 2015, and they asked researchers about their attempts to reproduce work. Unsurprisingly, they found that 70% of researchers couldn’t reproduce other people’s work. Reproducing some of this is really hard.
[00:20:35.23] What was shocking to me and remains shocking to this day is that 50% of them couldn’t reproduce their own work. And if you think about that for a second, and you think about– I hope that your filing system for your documents is better than mine and your naming system is better than mine. But I quite often can’t find something six months later. And I’m talking about collaboration, which is one of the most important reasons to share data.
[00:21:00.76] I read a comment piece from a researcher, who said that the most important collaborator that you need to think about when making your data understandable, findable, accessible, independently understandable on its own is yourself six months from now, because you are not going to be replying to emails to yourself six months from now. Irreproducible research really costs money, so there is an estimate that $28 billion a year, US dollars a year, is spent on research that cannot be reproduced.
[00:21:40.52] And the European Commission has recently commissioned a really interesting study from PricewaterhouseCoopers, which they will tell you is extrapolation from some known things that says that the cost savings to the European Union could be as much as $10 billion a year by better management of data. And that’s largely to do with reducing duplication of efforts and inefficiency.
[00:22:13.28] So what about for researchers? Does it mean dollars for them? Well, and I’m not going to cover this properly, because I don’t have good enough data, and we don’t have time. But the costs for managing research data properly by researchers are actually much more significant than we think.
[00:22:32.45] So I’ve been in two workshops recently with researchers, where, consistently, a number of research PIs, principal investigators of labs have told us that they spend as much as 20% of their research grants on the cost of either finding or managing data. Sadly, that’s all too often taking analog data and translating it into well-structured data.
[00:22:59.58] But all there benefits? Are there benefits for the researcher themselves? Yes. So good data practice has been shown in a very large study of National Institute of Health and National Science Foundation studies, 7,000 studies, up to double the publication outputs of a research study. And research data articles with data openly available and linked to from those articles have a– are associated with a citation advantage of up to 50%.
[00:23:35.04] And Springer Nature has just worked on a large scale study with the Turing Institutes, where we looked at half a million BMC and plus articles, and we confirmed that lower end figure of a correlation of about 25%. So if you’re a publisher, that also means there are advantages to you and your journal for– in your journals for making data openly available.
[00:24:02.70] So– but does data mean dollars for publishers and for journals? So we’ve been doing a lot of work over the last few years to really understand the challenges for researchers and how we can help. And we’ve come up with five essential factors. All of this data are in Figshare, and the white papers are in Figshare as well. So it’s all openly available. So we have a responsibility. As publishers, we have a responsibility. The number one motivator for researchers in sharing their data is increasing the impact and visibility of their research, and that’s what we’re all about in terms of being publishers, but actually, policies make a difference to researchers as well.
[00:24:46.39] Don’t start from scratch. The Research Data Alliance has published a model framework, is a collaboration of many publishers, funders, and institutions. You don’t have to create this from scratch, and there’s lots of resources like this available to you. Credits– we’ve had that before already. This is a key problem for getting researchers to spend the time and efforts making their data easier to find and access. They do not have sufficient credits to do so, but they are motivated by citations.
[00:25:20.02] So there are things that we can do. And again, like data citation is really important. Follow the FORCE11 principles. Don’t feel that you have to create a new way of doing it. Is there money in data publishing? Potentially. The two largest data journals in the world at the moment are Scientific Data, which we publish, and Data in Brief. Together, we publish about 2,000 articles a year, so of the over a million articles that are published in scholarly literature every year, we’re publishing a tiny percentage. So potentially, but not yet.
[00:25:54.52] Researchers need help, really need help, especially with organizing data and metadata. Mass data is very unsexy until you know what it is and what it can do, and then it becomes really sexy. And they also need help and support. They really don’t feel they need– they get enough help and support. And of the organizations and groups that they look to, they look to publishers. And they will also be looking to you if you’re a society.
[00:26:22.52] Lastly, I just wanted to mention that– that you could look at this and see the need and think, well, it’s easy. We’ll just help them do it. And we can turn that into a business. And I hope Mark’s going to talk about the success that Figshare has had. And we, like many publishers, partner with Figshare to help researchers make their supplementary information more accessible.
[00:26:44.72] But we also took our experience of helping researchers curate their data and their message data from scientific data, and based on what we’ve learned about what they wanted, we developed a service called Research Data Support, where we actually help researchers create a really rich metadata record, link it to their article, and make– you know, and so, therefore, both objects are citable and link together.
[00:27:07.43] And we’ve been very fortunate to partner with organizations like Wellcome, who have supported this for researchers at no cost to the individual researcher. Researchers should be biting our hands off to help– let us help do this, and they are not, because there is not enough incentive for them to spend the time doing it. So are data really the new world? I really hope so, but not yet. Thank you.
[00:27:32.02] [APPLAUSE] [00:27:39.99] MARK HAHNEL: So everyone set me up nicely, so we’re going to finish on time for lunch. And we might even have time for a question, if anybody has any. I’m going to TLDR a lot of what everybody said there in my talk. But just to get us started, I want to ask two questions. So put your hands up if you think peer review is important for the academic process. Pretty much everybody. Put your hands up if you think that peer review is free. It doesn’t cost anything to publishers, right? OK, so everybody thinks there’s a cost to it. Bright audience. OK.
[00:28:18.85] So for those who don’t know Figshare, I’m going to– oh, my title is Data Are the New Papers, and I’ll explain what that means and where that dollars fit in. We provide data infrastructure, or we started of providing data infrastructure for different groups who wanted to deal with this, you’re being mandated to make your data publicly available. How are you going to cope with that? And so we’re really good at that.
[00:28:42.16] And here’s four examples of four data infrastructures that we do, and you can see how publishing data is blurred. We heard it’s digital objects, right? So NIH– they’re saying, just publish all of the data sets. And we want some level of checking. That’s important. Royal Society– here’s all the data sets associated with the Royal Society open biology.
[00:29:05.32] The far one is University Data Repository from Carnegie Mellon, and they said, this is great. Can we also use it for our paper repository, because it handles all of those files as well, as well as it handles any other file. And then the middle one, the one second from the left is Chem Archive, which is a pre-print server, because they say, because you can handle papers, and you’ve got some basic level of checking going on.
[00:29:28.09] So then I started thinking, what is the difference between data and papers and how we think of it and where the money comes in? And the push has all been from– I used to show this slide back in the early days of Figshare of 2011, 2012. And I said, there’s a tidal wave coming of information. Researchers are going to have to make their data available. And when I spoke to funders, they said, by 2020, we’ll be– these were in twos– recommended open access to papers, recommended open access to data, mandated open access to papers and data, and force mandated.
[00:30:01.66] And they said, by 2020, we’ll be at number four. And pretty much, it’s true. It happens everywhere now– the European Commission, the Wellcome Trust, folks like that. And then in January this year, Donald Trump signed in that all federally funded research in America– NIH, NSF, NASA– are going to have to make their data available, and the mandates are just rolling out, but that tidal wave is coming. So everyone’s going to have to do it, right?
[00:30:27.16] And so if we compare what the difference is to data and publications, this is from a paper back from Jason Priem and a few other people, saying, what is a paper? He was trying to say that you could take everything that a publisher does, and Twitter will fix it. And I’ll reiterate some of that. So he said, this is really what you need. It’s storage, publishing, DOI, preparation, search, marketing, and assessment. And I’m going to look at this compared to what our infrastructure does. There are other data infrastructure providers, but I know our own best.
[00:30:59.79] So I can say that we handle that. We handle that fine. And if you look at it, what that means, these are what our outputs look like. So they’ve got DOIs. We’re tracking metrics. We do visualization of these different file formats. So if you look at identity, yep, we have DOIs for everything. Everything that gets published gets a DOI, and for all the clients we work with, we’re a DOI minting service. So if you are Sage, you can have Sage branded DOIs. It’s got your own brand to it now for data sets.
[00:31:28.98] And you can have it on your own domain, so it’s complete branding on that level. In terms of publishing, we now have Google Dataset Search, because what does publishing mean? It means I can look for it and find it somewhere, right? So this is early days of publishing datasets, but in reality, I just search Google and hope to find it there.
[00:31:47.52] In terms of storage, every paper on the planet is about 20 terabytes, from what I gather from [INAUDIBLE]. And so we’re going to be, like, nearly 200 terabytes of storage this year. So I think we can do the storage side of things. So OK, so what about the other thing? That’s the base layer of publishing. What about the other things?
[00:32:09.12] Marketing and search– I’d argue we can cover some of that as well. Search– we have good search functionality, but it’s also this idea that, where does it get indexed? So it’s not just data. It’s other stuff as well. So if you put a pre-print on there, we’ll make sure it ends up on Google Scholar. If it’s a dataset, we’ll make sure it ends up in all these places. But we’ll also do that stuff around FAIR data, which is human and machine readable metadata formats, feeds, things like that.
[00:32:37.33] Marketing– I mean we’re a commercial company. We put out white papers on data visualization in humanities, so we will do some marketing for your research in the same way I think publishers do around some of their papers. And you know, they get picked up. This is someone from the University of Sheffield. His data was in Wired magazine, because it was interesting data that was promoted in a way. And we can track the impact of it using old metrics, so you can see all of the impact it’s having.
[00:33:06.25] So this brings me to my final thing, which is the preparation and assessment, because so far, everything I’ve talked about is technology. And technology scales really well. Figshare’s like 35, 40 people. Publishers are bigger than that. And that is because if you look at the NIH, there’s two levels of preparation that I want to talk about. And this is where I think it’s interesting and relevant in terms of cost, but also a potential market opportunity for publishers.
[00:33:36.91] FAIR data– Findable, Accessible, Interoperable, Usable, Findable. So as we heard from Simon, it’s really hard to label things. When the NIH, with their data repository for the NIH, they say we want to improve the metadata. This is an example on the left of the metadata they’re talking about. So the file’s called– and the title of the object is Angiogenesis Array. If you Google that, you’re not going to find this. Who Googles Angio– yeah, anyway.
[00:34:03.96] But so this is what it actually should be, so that’s basic light touch checking improvement of metadata. And we think you can get to FA that way– Findable and Accessible. But the idea is we really need to get to FAIR, and so FAIR is that improving the data, checking all of the files, making sure it’s interoperable and reusable for a community, which is what Grace was talking about, and saying that the Wellcome Trust are paying for some of that. So there is a model there, but it’s a higher barrier.
[00:34:33.17] So the thing that I just wanted to– there’s two things I want to say, which is there is money out there. Everyone is going to have to publish all of this research data, so where are they going to publish it? Who’s going to do the checking? And there’s two levels of checking. I think there’s a big opportunity for the $50 on top of your APC. Somebody will make your data files better in terms of metadata, and then there’s the more detailed data publication of next level just pay an APC, and you get that same level of kind of quality control.
[00:35:04.64] And the last thing I just wanted to say is I’d really like to get to the point where data become papers so that the article of the future looks like this in the same way that I read Guardian news blogs, and it looks amazing. I’d like papers to look that way, too. So I think there’s a happy medium. Thank you.
[00:35:22.05] [APPLAUSE] [00:35:29.40] DAN POLLOCK: Thank you to all of our speakers. Absolutely terrific. I hope you’ll agree. We’ve got a couple of minutes left, so do we have a roving microphone? Maybe not. Are there any questions in the last couple of minutes? I know I’m standing between you and maybe lunch, but so OK. I’ll take that as a no. Thank you, once again.
[00:35:50.81] We’ve covered a lot of ground. I think there are two things that came out of here. Number one is we have a long way to go, but the destination is worthwhile. Is it where– you’ve heard that it’s tough to put a value on all this. There’s a lot of cost to it. We had a lesson in grammar. By the way, for grammar trivia, both– data can be both singular and plural. If you are technically involved, data are plural. But in lay terms or colloquial terms, you can talk about your data is good or bad. So apparently, we can all be right on that.
[00:36:26.51] So I mean certainly, the destination is worthwhile. We’ve heard about some terrific things going on. We’ve heard, I think, that there’s some really good facilities out there. There’s some emerging standards that you can engage with. So that’s sort of my first theme. My second theme, I think, is to echo the point about cats. I wholeheartedly agree there should be a cat in every site.
[00:36:44.05] And actually, my personal philosophy is not merely to use any cat in every presentation that I use, but to use our cat in the presentations that I use. And so on that, because we know the internet is designed for showing photos of cats on, I will leave you with that thought. And thank you, and thank, once again, our panelists for a terrific series of presentations. Thank you.
[00:37:01.70] [APPLAUSE] [00:37:03.50] [MUSIC PLAYING]