8

I'm on my senior year of college, and my professor and I have decided on an independent study where I will analyze and develop some method(s) for screening valid data sets for the 'selection' phase of data mining, where spam bots, dead accounts, and other irrelevant data sources are omitted from machine learning analysis.

The problem i'm running into is that I really want to use a data set that is from the real world, so I could truly discover something meaningful from the data that hasn't been discovered before.

My initial thought was to use the Ashley Madison hack dump from the summer of 2015, but I wanted to consider the ethical implications of using such sensitive data. Alternatively, I was thinking I could manipulate the data before analyzing it to provide some sense of anonymity to the victims of the hack (for instance, replacing all full emails with the first and last character of the name, as well as the @ and suffix).

My question is NOT whether you think these practices are ethical, but whether there has been some professional work done in the past anyone is aware of that can serve as a model for my current work.

For example, Facebook was caught manipulating the content of its users' Facebook feeds in order to measure the emotional reaction of the content on their future posts.

  • There are a lot of publicly available datasets and even more available to academics. While the Ashley Madison set would add a certain excitement to your work, I'm sure you can find a more legal one if you're worried about possible consequences. – Ric Jun 10 '16 at 15:35
  • 1
    Note that the database is not necessarily (probably not at all) illegally obtained by you, it was illegally obtained by the person who posted it online. Your question title suggests you might hack some website and download their DB to use in your research. – einpoklum Jun 10 '16 at 16:11

3 Answers3

7

Relating to just the AshleyMaddison data, there is a plethora of articles on your exact question. I will summarize the main talking points and provide links at the end.

1. Can I download it?

The data is probably public-domain, but it depends on where you live. In some countries like the US, the data itself is publicly accessible and part of a wider conversation about personal rights to privacy, whilst in others like Canada, is has been explicitly decided that AshleyMaddison still holds the copyright and downloading the data is akin to acquiring stolen property (note, AM is owned by AvidLifeMedia, a Canadian company, and people have said that this decision was to offer some degree of protection to ALM, as hacked users cannot use the data dump as evidence in court). Other countries say that fundamentally nothing that can be downloaded can be stolen, so this doesn't even apply.

2. Can I share portions of the data with others?

If you are in a jurisdiction that allows you to download it, you can by default share it - and many (perfectly legal) sites exist for just that. You type your name or e-mail address in to see if you are part of the hack (or your spouse was). For better or for worse, there is no crime for making public-domain data easier to access.

3. Can I process the data and share summary statistics?

Oddly, unlike the above 2 issues, this is legal in every country which offers protection for journalists and researchers - including Canada. Many researchers, particularly those researching infidelity, have asked lawyers this question, and they all say the same thing - yes it's legal. In fact, top US lawyers have gone as far as to say that journalists are probably in the clear if they publish a list of names of celebrities who appear in the hack, for whatever little public good/interest there might be there.

Many articles also point out that there is a big difference between the legality of this, and the ethics of this. Both the law of the land and what is considered ethical behaviour changes over time, and they don't always have to be in sync. Some say that in using the data you are condoning and even encouraging the hack - which may lead to more hacking/data dumps in the future. Other's say that your research may be the only grain of good to come out of the whole debacle.

I will summarize by saying what I would personally do, which is use the data, get the outcomes, then way up the pros and cons of those outcomes with the example you will be setting for others in using this data. Legality is really not the issue here at all, because even if it is illegal, you are incredibly unlikely to find yourself going to court from either Ashley Madison or hacked users, as both parties would have a very poor case. Those aren't my words, those are the words of Jennifer Granick, a law professor at the Stanford Center for Internet and Society.

So if the real question here is the ethics, then this is something that is ultimately up to you to decide on. There may be all manner of real repercussions from your department - particularly if someone in the department turns up in the database... - but that's a very different question.

http://www.huffingtonpost.com/entry/ashley-madison-hack-creates-ethical-conundrum-for-researchers_us_55e4ac43e4b0b7a96339dfe9

http://www.al.com/news/index.ssf/2015/08/is_it_illegal_to_download_the.html

https://onlinejournalismblog.com/2015/07/20/ashley-madison-ethics-journalism-hacked-documents/

http://fortune.com/2015/08/19/ashley-madison-media/

Finally, there are many stories that people have posted online detailing their experience at the hands of the hack. Some say it wasn't really them, some say it was. Some are outraged, some are just numb. I would suggest reading one or two of the longer blog posts to really get a sense of what this data means to some people. It's more than just a resource. People have committed suicide due to the shame or discrimination they faced as a result - most notably people from the LGBT community - so it's really important to not shy away from that when deciding to proceed, or not, with your research.

Wetlab Walter
  • 6,051
  • 3
  • 21
  • 30
  • 1
    What's your basis for saying that, "Downloading copyrighted data is akin to acquiring stolen property"? – einpoklum Jun 10 '16 at 16:09
  • That many people have been found guilty of downloading copyrighted works which they did not legally obtain? If your setting up a "but in Spain/Switzerland/Cambodia/etc it's legal" trap for me, I refer you to the sentence after the sentence you quoted. – Wetlab Walter Jun 10 '16 at 16:18
  • 1
    Regarding point 3 - there is a difference between the process of how journalists can use the data, and how an academic/researcher can use the data. (I am pretty sure that an Academic/researcher in US (and Canada) will need to go through an internal IRB (or IRB like-body) to get approval of their plans for how they will handle and report on this kind of data. – Carol Jun 10 '16 at 16:18
  • @Carol It's going to be jurisdiction-specific, but in the US at least, there is an exemption for using data that is already public domain "Research involving the collection or study of existing data, documents, records, pathological specimens, or diagnostic specimens, if these sources are publicly available..." - http://www.hhs.gov/ohrp/regulations-and-policy/regulations/45-cfr-46/index.html#46.101 – Wetlab Walter Jun 10 '16 at 16:25
  • @Wetlab_Walter, thanks - aren't those regulations written under assumption publicly available databases/records were generated so that they already satisfied the human subject issues. However, I acknowledge that this is way outside my field, my jumpy sensitivity is from listening to discussions within university committees charged with keeping our research misconduct/ethics policies. The main message I got out was, always in good faith put it in front of the IRB if it involves human subjects, even if it is to say that it falls under the NIH regulation. – Carol Jun 10 '16 at 16:48
  • @WetlabWalter: Your conclusion does not follow from your premise, unless "akin" is in the sense of "also being illegal in some cases". Also, have many Canadians been found guilty of this? (I'm asking because I honestly don't know). Finally, what about fair use, even in Canada? – einpoklum Jun 10 '16 at 16:59
  • Thats a good point Carol - the power handed down from the US government to IRBs (and the exceptions to their jurisdiction) as laid out in that linked document probably was written with the assumption that publicly available data cannot illegal and thus not ethically ambigious... however as you rightly go on to say, thats not for us to decide. The law is the law, even when its unfair, which is why both of us have in-house ethics committees who say "before you do anything, check with us!". And thats a very good point that I totally overlooked - the OP must check if they also need such approval. – Wetlab Walter Jun 10 '16 at 17:01
  • 4
    I think you're conflating "public domain" and "publicly published"; to be in the public domain is to have no copyright, and just because something is public doesn't mean there is no copyright attached to it. – gsnedders Jun 10 '16 at 18:46
  • On the one side i've got einpoklum criticising me for suggesting that the works are protected under copyright (I know AM has sent out lots of DCMAs and many think these notices aren't legal, because databases of facts on their own often aren't copyrightable), and i've got Geoffrey on the other side telling me I'm conflating "public" with "public domain", suggesting I've failed to grasp that the works are copyright. I can't win :P – Wetlab Walter Jun 10 '16 at 19:51
  • 1
    Very thorough - by far the best answer I see on here. I also agree with your suggestion for modeling the study (analyzing the data, then determining the precedent I would set by releasing it). – Cameron Sonido Jun 15 '16 at 13:19
3

You have no legal right to use this data in your research. Basically what you are describing is gaining unauthorised access to sensitive personal data and using it without the permission of the data holder or the individuals. The fact that some hackers posted the data on the internet makes this unauthorized access really easy, but it doesn't change what you are doing from a legal perspective.

Anonymising the data in some way doesn't change this. There are certain cases where anonymising data makes it legally usable for certain purposes--but not in this case, where you have no right to use the data in any way.

Using this data would put you at risk of prosecution and probably jail time. And, if you plan to do some research and try to actually publish it, I'd say this is a very real risk. Even if you avoid legal consequences, it's quite likely this will affect your ability to publish the work.

Caveat: I'm not a lawyer. Ask one if you want more information.

  • 18
    If you are not a lawyer and don't know what you're talking about, you shouldn't be telling someone that they have no legal right to do something and that they risk jail by doing it. This is worthless advice, which at best would be correct only because you got lucky and guessed right, and in the worst (and quite likely IMO) case would simply be factually incorrect. But maybe you think it's okay to lie and tell someone something is illegal even when it's not, to stop them from doing something that you think is unethical? That would make your answer itself unethical. Not good I'm afraid. – Dan Romik Jun 10 '16 at 15:54
  • 4
    Even if you're not a lawyer - on what basis are you claiming that OP has no right to use such data? Please restating your answer to regard your ethics rather than legality. – einpoklum Jun 10 '16 at 16:06
  • 4
    Surprisingly, one actually can understand law without being a lawyer. Did you know that some people actually become familiar with traffic laws before they start driving? Some focus on tax law and have the audacity to charge for their services even though they aren't lawyers either. – The Nate Jun 10 '16 at 16:15
  • @TheNate, not that I disagree with you entirely, but the tax example is nearly bogus. To offer yourself as a tax advisor in many countries you must be a Certified Public Accountant or Chartered Accountant which requires a minimum of training and examination endorsed by the local government. You can't just read the tax code and hang out a shingle in many jurisdictions. Also, I wouldn't recommend hanging out a shingle giving traffic law advice lest you get charged with practicing law without a license (which is illegal in lots of places). – Bill Barth Jun 10 '16 at 16:20
  • 3
    If @dan1111 thinks they understand the law on this issue, then they should probably cite and example or case law which supports thir conclusion. Whether lawyer or non-lawyer. – Daniel R. Collins Jun 10 '16 at 16:24
  • 1
    IANAL, but I think it's pretty well-settled, at least in the United States since Feist, that you do have the right to use illegally-obtained database information if you didn't break any laws to obtain it, absent a few exceptions that aren't applicable here. Whether it's ethical is another question and in other jurisdictions, the rules are different. – David Schwartz Jun 10 '16 at 17:24
  • 1
    @Bill Barth : That was actually the point. The state governments certify non-lawyers to give legal advice. That means they formally recognize a very different test than "are you a lawyer?" as the legal yardstick. (How many police officers are lawyers?) I do agree with Collins that some form of citation is needed, of course. – The Nate Jun 10 '16 at 18:01
  • @DavidSchwartz IANAL as well, but I would be very cautious about relying on Feist - that revolved around the question of whether reusing it was legal, but there was nothing legally dubious about the information or its source. It feels quite a distinct situation. – Andrew is gone Jun 10 '16 at 19:09
  • @Andrew I'm not relying on Feist for that. Feist just removed the last obstacle, which was copyright. So after Feist, all the pieces were in place. Feist itself being just one of them. – David Schwartz Jun 10 '16 at 19:41
1

The answer of @dan1111 covers very well the ethics/legality of using this particular data set. To answer the question: are there standard methods of handling the human subject aspect of the proposed research? (with respect to your instinct that you should anonymize the personally identifiable data in it). I will assume you have a similar public dataset that was legal and that you did get the permissions to use.

The answer is yes. But at least in US universities, what you propose is classed as human subject research because of the personally identifiable information. Your research plan would include strategies on how to protect the human subjects, (protecting the the personally identifiable data). And it would first be submitted to the university internal review board (IRB) before you touched anything. They might expect a plan that ensured that the PII be anonymized even before you get to 'view' the dataset. What you propose might be enough, but they may require more steps or protections in place.

There are several questions that are similar on stack exchange and the answers will also mostly jump to quickly point out the equivalent of IRB approval. Ethics of scraping “public” data sources to obtain email addresses, Sensible measures to ethically use freely available, but personal web-based comments in research?, Is it legal/ethical to use data grabbed from a Stack Exchange site in a paper?

Carol
  • 1,770
  • 12
  • 11