61

While researching a topic area I have come across a number of papers that claim to improve on the state of the art and have been published at respected outlets (e.g. CVPR, ICIP). These papers are often written in a way that obscures some of the details and their methods can be lacking in detail. Upon contacting these authors for more information and asking if they would kindly make their source code available they stop replying or decline the offer.

Why are computer science researchers reluctant to share their code?

I would have expected that disseminating your source code would have positive effects for the author, e.g., greater recognition and visibility within the community and more citations. What am I missing?

For the future, what are some better ways to approach fellow researchers that will result in greater success at getting a copy of their source code?

D.W.
  • 8,221
  • 2
  • 31
  • 57
Stephen Tierney
  • 1,948
  • 1
  • 16
  • 18
  • 3
    An important issue, but you you split it in two questions? (On SE sites there should be one question per one... well, question.) That is, could you make another post out of the second question? – Piotr Migdal May 27 '13 at 09:12
  • 5
    I considered separating the questions but thought that the 2nd would not stand on it's own. – Stephen Tierney May 27 '13 at 09:45
  • You may want to take a look at Collective Mind initiative: http://www.hipeac.net/system/files/grigori.pdf and corresponding publication model http://ctuning.org/cm-journal – Michael Pankov May 28 '13 at 09:06
  • I could post an answer, but it would something like - how could we change things so that more people will publish their source code? Would that be acceptable, or does it belong to a different question? – Faheem Mitha May 28 '13 at 16:06
  • @FaheemMitha, that sounds like a different question to me -- but a good one. Why don't you post a different question asking how to change things so that more CS researchers will share their source code? – D.W. May 28 '13 at 16:08
  • The fix for this problem is simple, make CS research (and research in general) better funded. The problem is there is no support for the cost of getting research code to a level of quality where maintenance costs are supportable. – Dikran Marsupial May 29 '13 at 08:21
  • 3
    @DikranMarsupial I'm not convinced that throwing more money at researchers will improve the situation as they have limited time. I don't think that distributing high quality code is the issue, it's just an excuse people use. Releasing bad code is better than releasing none. – Stephen Tierney May 29 '13 at 11:13
  • additional funds will solve the problem as it means that either they can hire somebody to maintain the code or they can be bought out of other duties to release the time to do it themselves. I like releasing code, and have done so in the past, the reason I don't release more is that I don't have the time for maintenance, that I know from experience is necessary. I disagree that bad code is better than none, I wasted several weeks earlier this year trying to get someone else's research code to work, to no avail. Of course the authors were of limited help as they have the same problems I do. – Dikran Marsupial May 29 '13 at 11:57
  • I should add, my wife is quite active in providing code and is driven to distraction by the volume of user requests for help, bug fixes, extensions etc. In my case there is a limited audience for my software, if you are in a field where there is a large audience, the cost of maintenance is far from trivial. – Dikran Marsupial May 29 '13 at 11:59

5 Answers5

42

Why researchers might be reluctant to share their code: In my experience, there are two common reasons why some/many researchers do not share their code.

First, the code may give the researchers an important advantage for follow-on work. It may help them get a step ahead of other researchers and publish follow-on research faster. If the researchers have plans to do follow-on research, keeping their code secret gives them a competitive advantage and helps them avoid getting scooped by someone else. (This may be good, or it may be bad; I'm not taking a position on that.)

Second, a lot of research code is, well, research-quality. The researchers probably thought it was good enough to test the paper's hypotheses, but that's all. It may have many known problems; it may not have any documentation; it might be tricky to use; it might compile on only one platform; and so forth. All of these may make it hard for someone else to use. Or, it may take a bunch of work to explain how to someone else how to use the code. Also, the code might be a prototype, but not production-quality. It's not unusual to take shortcuts while coding: shortcuts that don't affect the research results and are fine in the context of a research paper, but that would be unacceptable for deployed production-quality code. Some people are perfectionists, and don't like the idea of sharing code with known weaknesses or where they took shortcuts; they don't want to be embarrassed when others see the code.

The second reason is probably the more important one; it is very common.

How to approach researchers: My suggestion is to re-focus your interactions with those researchers. What are your real goals? Your real goals are to understand their algorithms better. So, start from that perspective, and act accordingly. If there are some parts in the paper that are hard to follow or ambiguous, start by reading and re-reading their paper, to see if there are some details you might have missed. Think hard about how to fill in any missing gaps. Make a serious effort on your own, first.

If you are at a research level, and you've put in a serious effort to understand, and you still don't understand ... email the authors and ask them for clarification on the specific point(s) that you think are unclear. Don't bother authors unnecessarily -- but if you show interest in their work and have a good question, many authors are happy to respond. They're just grateful that someone is reading their papers and interested enough in their work to study their work carefully and ask insightful questions.

But do make sure you are asking good questions. Don't be lazy and ask the authors to clear up something that you could have figured out on your own with more thought. Authors can sense that, and will write you off as a pest, not a valued colleague.

Very important: Please understand that my answer explaining why researchers might not share their code is intended as a descriptive answer, not a prescriptive answer. I am emphatically not making any judgements about whether their reasons are good ones, or whether researchers are right (or wrong) to think this way. I'm not taking a position on whether researchers should share their code or not; I'm just describing how some researchers do behave. What they ought to do is an entirely different ball of wax.

The original poster asked for help understanding why many researchers do not share their code, and that's what I'm responding to. Arguments about whether these reasons are good ones are subjective and off-topic for this question; if you want to have that debate, post a separate question.

And please, I urge you to use some empathy here. Regardless of whether you think researchers are in right or wrong not to share their code in these circumstances, please understand that many researchers do have reasons that feel valid and appropriate to them. Try to understand their mindset before reflexively criticizing them. I'm not trying to say that their reasons are necessarily right and good for the field. I'm just saying that, if you want to persuade people to change their practices, it's important to first understand the motivations and structural forces that have influenced their current actions, before you launch into trying to browbeat them into acting differently.


Appendix: I definitely second Jan Gorzny's recommendation to read the article in SIAM News that he cites. It is informative.

D.W.
  • 8,221
  • 2
  • 31
  • 57
  • 2
    +1 Most of the code is written to a deadline, be it a PhD student trying to finish up or a post doc getting a deliverable done in time. I know many people (myself firmly included) who would be embarrassed to be judged on their code quality rather than the actual research it supports. – ThomasH May 31 '13 at 00:21
24

Stephen, I have just the same experience as you do, and my explanation is that the benefit/cost ratio is too low.

Packing a piece of software, so that it can be usable by another person, is difficult - often even more difficult than writing it in the first place. It requires, among others:

  • writing documentation and installation instructions,
  • making sure the code is runnable on a variety of computers and operating systems (I code on Ubuntu, but you may code on Windows, so I have to get a Windows virtual machine to make sure it works there too),
  • answering maintenance questions of the form "why do I get this and that compilation error when I compile your program on the new version of Ubuntu" (go figure. Maybe the new version of Ubuntu dropped some library required by the code? who knows).
  • taking care of 3rd-party dependencies (my code may work fine, but it depends on some 3rd-party jar file whose author decided to remove from the web).

Additionally, I should be available to answer questions and fix bugs, several years after I graduate, when I already work full-time in another place, and have small kids.

And all this, without getting any special payment or academic credit for all that effort.

One possible solution I recently thought of is, to create a new journal, Journal of Reproducible Computer Science, that will accept only publications whose experiments can be repeated easily. Here are some of my thoughts about such a journal:

Submitted papers must have a detailed reproduction section, with (at least) the following sub-sections: - pre-requisites - what systems, 3rd-party software, etc., are required to repeat the experiment; - instructions - detailed instructions on how to repeat the experiment. - licenses - either open-source or closed-source license, but must allow free usage for research purposes.

The review process requires each of 3 different reviewers, from different backgrounds, to go through this section, using different computers and operating systems.

After the review process, if the paper is accepted for publication, there will be another pre-publication step, which will last for a year. During this step, the paper will be available to all the readers, and they will have the option to repeat the experiment and also contact the author in case there are any problems. Only after this year, the paper will be finally published.

This journal will enable researchers to get credit for the difficult and important work of making their code usable to others.

EDIT: I now see that someone already thought about this! https://www.scienceexchange.com/reproducibility

"Science Exchange, PLOS ONE, figshare, and Mendeley have launched the Reproducibility Initiative to address this problem. It’s time to start rewarding the people who take the extra time to do the most careful and reproducible work. Current academic incentives place an emphasis on novelty, which comes at the expense of rigor. Studies submitted to the Initiative join a pool of research, which will be selectively replicated as funding becomes available. The Initiative operates on an opt-in basis because we believe that the scientific consensus on the most robust, as opposed to simply the most cited, work is a valuable signal to help identify high quality reproducible findings that can be reliably built upon to advance scientific understanding."

Erel Segal-Halevi
  • 16,877
  • 14
  • 66
  • 111
  • 10
    I don't think that's a valid answer. You can just document your setup or even just distribute the code. Even if I can't run, I can learn lots by just inspecting it. – Spidey May 27 '13 at 13:03
  • 10
    I also thought this way, but over time I learned that when you release your code to the public, you inevitable have some responsibility over it. If the code isn't well-documented, if it doesn't compile, if it doesn't run - people will hold you responsible, and it might be bad for your reputation. – Erel Segal-Halevi May 27 '13 at 13:15
  • 8
    I agree that if your code doesn't compile or doesn't run it might be bad for your reputation but I think it's reasonable to turn down requests to change/fix your code if you are simply publishing it. If you're managing an open source project that's different but why would publishing source code require more than simply answering questions (as would any publication)? – earthling May 27 '13 at 14:16
  • 1
    Submitted papers must have a detailed reproduction section -- a similar idea has been bandied about in the computer architecture community for a few years, but it involves providing a working virtual machine with the code installed and ready. You boot the virtual machine and away you go. – Chris Gregg May 27 '13 at 15:16
  • 4
    @ErelSegalHalevi It's much worse for the reputation to hide the implementation details. IMHO, it basically means "believe me". That's not how science works. Hiding the code violates the most important law of making science: falseability. You can't invalidate a work if you don't have access to it. The author can hide behind this black curtain denying whatever attempt to reproduce/invalidate his paper by saying it's not identical to his method. – Spidey May 27 '13 at 16:28
  • 4
    @Spidey, Erel's answer is completely accurate. He does describe a mindset that many researchers do have. That mindset might be a good one, or might be bad for the field -- but regardless, what matter is that many researchers do share that mindset, and act accordingly. The original poster asked for an explanation of why many researchers have decided not to share their code; Erel has given an accurate description of why some/many researchers have decided to do so. You can agree or disagree with whether they've made the best choice, but that's not the question here. – D.W. May 28 '13 at 04:49
  • 1
    @Spidey, I agree that it would be much better for science to have all code published. That's why I suggested a way to encourage authors to publish their code. – Erel Segal-Halevi May 28 '13 at 07:24
  • 2
    +1 for "Journal of Reproducible Computer Science", which makes for interesting googling, by the way. But as a developer I think your requirements for packing the code are much too strong. Anyone who knows the phrase "bit rot" will know that it's unreasonable to expect someone else to maintain code which was simply created to demonstrate a point. – l0b0 May 28 '13 at 08:14
  • 1
    There's a lot of research in this area made by G.Fursin : he tries to make the CS reproducible. See presentation at http://www.hipeac.net/system/files/grigori.pdf for example. He also tries to push a new publication model http://ctuning.org/cm-journal – Michael Pankov May 28 '13 at 09:04
  • @D.W. You are 100% right. I meant to say it's not a valid reason not to share. – Spidey May 28 '13 at 13:11
  • 2
    @ErelSegalHalevi: Your dream of having reproducible papers is very nice. But think about publication time. Currently it takes about 6-18 month to publish a paper. Publishing in that journal may take 2-3 years if non-paid reviewers (usually professors) are needed to read all your documentation, learn used language, install tools, troubleshoot the code, and run to check if everything is okay. It sound more practical to use a cloud instance, load the code, and enable reviewers/readers to access your loaded code rather than doing all by themselves. Doing so, increase trust too. – Espanta May 31 '13 at 11:26
  • @Espanta: One step at a time. First, share the code. Then we go thinking about peer reviewing the code. – Spidey May 31 '13 at 19:46
  • 1
    @Spidey: "It's much worse (...)" - when only looking at science, yes, but (at least in my place) many PhD candidates have the ultimate goal of leaving academia after their PhD to either take a job in the industry, or start their own company. Once they are there, they will be judged by the quality of any code of theirs that is publicly available, and "experimental prototype style code" is going to reflect extremely badly on their reputation. Hence, I can fully understand if they share code only on request rather than make it downloadable somewhere. (They should react to requests, though.) – O. R. Mapper Dec 05 '14 at 06:56
14

This article in SIAM News sheds some light on the first question, so it might be worth a look. It argues, for a mathematical audience, why researchers ought to publish their source code, and lists many of the reasons you might hear why researchers do not share their source code. It does so by a clever analogy, one that compares the sharing of mathematical proofs to the sharing of source code. Take a look; it has quite an extensive list of reasons why researchers might prefer not to share their source code (as well as some responses arguing that those reasons are not good ones).

Here's a citation:

Top Ten Reasons To Not Share Your Code (and why you should anyway). Randall J. LeVeque. SIAM News, April 1, 2013.

Rup
  • 131
  • 1
  • 7
Jan Gorzny
  • 315
  • 3
  • 9
  • 6
    I suggest that you give more information about the article you link to. For example, state the title and perhaps a sentence describing the main idea. Note that links expire, and your answer would be more useful if it provided the information even if the link fails. – JRN May 28 '13 at 02:14
7

In sharing code there are several issues:

  • The first issue is the copyright matters, since some of CS researches/projects are funded by certain industrialists/funding organizations that discourage sharing sensitive information such as algorithms, code, or software while publishing in public periodicals.

  • Indeed, there are papers based on certain data (collected from code execution) that unfortunately are manually modified by the authors. If they share the code, catching their mistake/error/modifications becomes very easy leading to failure in either their MS/PhD or research project which is undesirable.

  • In CS research and especially publication, developing code, particularly a lengthy, complex code is a non-trivial task and in most of the cases is considered money-making and paper-generating asset. By sharing the code to the public, they are unveiling facts in very much detail which may degrade their contribution in future researches. Also they may not be the only one who can regenerate article and make credit of that particular research and code. In most of the cases, master students pick an algorithm or method, slightly change it and submit a thesis and paper based on it, that may contradict with the findings and claims of the first author. Remember Thoma Herdon a graduate students who criticized findings of two eminent economist of Harvard university(here is the link ). If the codes in CS are revealed the consequences are likely catastrophic (it might not be too many cases, but if happens it will be catastrophic).

  • Codes are vital property to most of the researchers to conduct experiment and research. If you have a code, you can simply play with it and modify it to generate new set of findings that might be more valuable than the initial findings. Without having authorship of the initial author, there is no credit to them.

However, Elsevier recently introduced a new feature using COLLAGE called Executable Papers that is currently available for Computers & Graphics journal by which codes and data are available and researchers can modify the code and input values to play with.

Hope it helps.


The Hiary
  • 5,163
  • 3
  • 39
  • 66
Espanta
  • 3,912
  • 2
  • 24
  • 40
  • 9
    If the codes in CS are revealed the consequences are likely catastrophic. — So you're accusing an entire intellectual discipline of fraud? Really? – JeffE May 27 '13 at 17:23
  • 4
    @JeffE I wouldn't be so harsh to call it all a fraud, but it would definitely improve the overall quality of research papers. – Spidey May 27 '13 at 18:09
  • 8
    Sounds like an accusation of fraud to me, or least criminal incompetence. The only reason publishing data/code would be "catastrophic" is if that data/code did not support the published conclusions about that data/code, as they didn't in the Reinhart-Rogoff paper referenced one sentence earlier. – JeffE May 27 '13 at 19:12
  • 1
    I think your 2nd point is really the most important one. If others can replicate state of the art research and tweak it slightly to produce new publishable results, then you lose the ability to capitalize on all your hard work developing the code to begin with. – Paul May 28 '13 at 01:47
  • you lose the ability to capitalize on all your hard work [citation needed] – JeffE May 28 '13 at 03:10
  • 3
    I don't get the second point and the latter half of the third point. Since when does CS stand for dishonest junk pseudo-science where authors manipulate data and hide the details because otherwise what they call "results" would be falsified? If it results in a catastrophe for authors to be all honest and make things verifiable, your field should have collapsed already and gone forever. Like JeffE said, you're accusing CS if you're suggesting these are valid answers to OP's question. You must present evidence. Oh, you collected evidence by your code and manipulated it? That's how CS works, huh? – Yuichiro Fujiwara May 28 '13 at 03:14
  • 1
    "copyright [...] large number of CS researches are funded by certain organization that does not allow people to share their codes" - Which organization are you referring to? I don't know of a single one that prohibits CS researchers to share their software, or any reason that copyright prevents sharing of software. I think this statement is just plain wrong. – D.W. May 28 '13 at 04:46
  • These are all good hypotheses, and the problem isn't confined to CS. Anyone know of a study to determine which is actually at play? (killer nonresponse bias, probably... =) – petrelharp May 28 '13 at 04:48
  • 1
    @Erel Segal Halevi: I edit the last paragraph and provided link to the feature. – Espanta May 28 '13 at 06:19
  • @JeffE: I have no intention of accusation, it is my own opinion from what I have already observed and that is why I used "likely" to protect good guys and researchers, like you, who do not falsify findings.I further add one more sentence to that. However, if the point hurts you and others, I really apologize. I had no intention for that. Sorry dude :) – Espanta May 28 '13 at 06:22
  • 1
    @D.W.: To me, it's logical that industrialists funding projects keep the right for the algorithm and code. Do you think papers from big companies of CS and communication can be replicated easily? and are available to the public? They may publish the significance of their research, but I think they try not to share valuable data such as code to the public. Don't forget competitors are looking at each other. A common evidence is the statement on published paper like "Approved for External Publication". This [http://www.hpl.hp.com/techreports/2011/HPL-2011-55R1.pdf] is one of those. Don't agree? – Espanta May 28 '13 at 06:33
  • 3
    @Espanta, it's a huge leap from "you think it would be logical if funders prevented researchers from sharing code" to what you actually wrote. Just because you think something would be logical, doesn't mean it is actually so. What you actually wrote in the answer is almost certainly wrong. If you care about accuracy, you will edit your answer to fix what you wrote and remove the claim that "large number of CS researches are funded by certain organization that does not allow people to share their codes". – D.W. May 28 '13 at 07:41
  • @D.W.: Thanks friend. I did further change to make it more accurate. Hope it does not have much problem now. – Espanta May 28 '13 at 11:44
  • 1
    My point of view on this is that if you can't share the code and data, you can't publish. Simple as that. You can still research and sell your workforce to anyone, but what good does it make to tell the science community of your achievement if no one is able to reproduce it? – Spidey May 28 '13 at 13:16
  • Yo must share data, but not necessarily code. In journals, reviewers asked to evaluate if the authors provide sufficient info to replicate the work. Code is not the only thing for research replication. There are many others that impact on results which varies from domain to domain. e.g., for cloud-based apps running on mobile devices, there are very large number of metrics like device and cloud type, prog.language, mobile-cloud distance, data collection time, network quality and many more. So replication of work is not easy in all domains; authors art is to establish trust in his paper. – Espanta May 31 '13 at 11:12
3

I am not a CS researcher per se, but I am writing Android code for my research in Atmospheric Physics, so my view is somewhat limited. However, I can say from my own experience that much of the code that I am developing and testing is part of a greater project that the team I am part of is developing. It is a mix of the rules I am bound by and the need to keep a portion of code under wraps for the time being.

Étienne
  • 101
  • 3