22

When looking at the paper-is-cited-by-paper binary relation in an undirected manner: Is all research connected to each other or will there be many connected components?

In other words: Is there a way from a, say, computer science paper to a chemistry paper by walking along the citation graph? Let's say we are only looking at publications in established, well-known conferences conferences/journals.

knub
  • 329
  • 2
  • 9
  • 2
    It's like playing the Wikipedia game. You go from one article to something really random just by clicking the in-text (anchor) links. If a paper has a citation to one paper that has a lot of citations and so on - by the rules of large numbers you will eventually get to something random. – Memj Sep 08 '15 at 15:45
  • 3
    It is unlikely that all reearch across all fields forms a single tree. All it takes is one paper which does not share the main tree to establish that and I'd be surprised if there wasn't one such. Then again: what practical difference does that make? What's the real problem you're trying to address? – keshlam Sep 08 '15 at 15:48
  • It may not be appropriate to use the term "good" when defining conferences and journals in formal context. – Ébe Isaac Sep 08 '15 at 15:52
  • 3
    @keshlam: Purely philosophical/interest ;-). – knub Sep 08 '15 at 15:55
  • @ÉbeIsaac: That was done on purpose. I think each researcher knows the "good" (established, well-known conferences ) in her/his area. Let's say tier 1 journals/conferences. – knub Sep 08 '15 at 16:00
  • @keshlam It doesn't form any sort of tree or even forest. At the very least, it's a directed acyclic but I'm sure that, in fact, you'll find plenty of papers that cite each other, especially in the modern era of preprint servers. – David Richerby Sep 08 '15 at 18:11
  • 1
    There are many papers that cite each other, but that isn't what was asked. All it takes is one instance which does not point to the rest of the universe to answer the posed question, and the odds of that not existing are nil. – keshlam Sep 08 '15 at 20:02
  • 1
    @DavidRicherby There's usually a few papers that annoyingly don't follow the partial ordering (ie. they cite papers in the future), so even directed acyclic isn't a safe bet. – Peter Bloem Sep 09 '15 at 00:21
  • 3
    @keshlam You seem to have missed the point I was making. You said, "It is unlikely that all research across all fields forms a single tree." That's not a matter of probability: it's a matter of clear fact because there is nothing even remotely tree-like about the citation graph. The situation where A cites B and C, and B and C both cite D is extremely common. That is not a tree. – David Richerby Sep 09 '15 at 06:28
  • 2
    @keshlam: The question explicitly says "undirected", which already implies no-one's looking for a tree. As such, all it takes is one instance which neither cites anything else, nor is cited by anything else. While this might exist, it might be difficult to identify it while still establishing it to belong to the considered class of documents in the first place. – O. R. Mapper Sep 09 '15 at 08:27
  • 2
    Let's say we are only looking at publications in "good" conferences/journals — Oh, no, please, don't ruin a perfectly good question. – JeffE Sep 09 '15 at 11:56
  • @JeffE: I changed the wording to "established, well-known". Do you think that is better? I think this is equally vague, isn't it? – knub Sep 09 '15 at 12:04
  • 2
    Nope. A citation, even from an arXiv preprint, an economics working paper, or an otherwise unpublished PhD thesis, is still a citation. – JeffE Sep 09 '15 at 12:11
  • 1
    My next research project will be "six degrees of citation" I guess... – PlasmaHH Sep 09 '15 at 12:58
  • @O.R.Mapper An undirected tree is still a tree... (If you look at Wikipedia they even take the point of view that trees are undirected by definition!) –  Sep 09 '15 at 15:36
  • @NajibIdrissi: Unless we are not dealing with a tree to start with, but with a (non-tree-shaped) graph. Who said we are looking at a tree? Yet, "undirected tree" - just as well, then "one paper that doesn't share the main tree" can be reached as soon as the branches of its tree touch those of the main tree, as movement is not restricted to one direction. – O. R. Mapper Sep 09 '15 at 17:53
  • 1
    @JeffE: I think your suggestion of adding anything up to a "unpublished PhD thesis" makes an already hard to answer question entirely unanswerable.? I still do not understand, why restricting the question to some document set "ruins" the question. – knub Sep 09 '15 at 18:20
  • 1
    @knub Not answering for JeffE, but for me, because restricting certain things for, essentially, convenience, moves this from a question about how citations actually work to a musing on a particular and artificially restricted graph (this is also the root of my issue with your strict definition of clustering). I'd propose that "If I take a citation graph and alter it in some very fundamental ways, what does it look like?" is headed toward off-topic for a question about how Academia actually works. – Fomite Sep 09 '15 at 18:47
  • @Fomite: If we allow everything, is that how "citations" actually work? My general issue with dropping a restriction to peer-reviewed papers or something similar is that the "natural limit" of citations is removed. Only so many papers get cited by other papers because there is a desire (if not a requirement) for brevity, and you have to pick what you cite by relevance. At the same time, the number of papers on a topic is restricted in the first place, as each paper needs to present a novelty. Lastly, the attention dedicated, and thus the number of papers written, is somewhat linked ... – O. R. Mapper Sep 09 '15 at 19:00
  • ... (maybe even roughly proportional sometimes?) to the degree of interest a research topic attracts. Superficially speaking, none of these restrictions are true for documents like Bachelor theses. There is often no page restriction, so everything that fits somehow can be cited. There is no requirement of containing new content or, especially, new insights. And the number of Bachelor theses is strongly dependent on whether a researcher knowledgeable on the respective topic happens to spend a lot of effort with offering/supervising Bachelor theses. Thus, the distortion might work the other way. – O. R. Mapper Sep 09 '15 at 19:03
  • 1
    @O.R.Mapper I would assert that it is. I have cited arXiv pre-prints, white papers and unpublished dissertations over my career, where it was necessary to do so. And I think "they can cite a lot of things" is probably one of the weakest reasons to consider excluding grey literature. – Fomite Sep 09 '15 at 19:40
  • 1
    @Fomite: In that case, is that really what we want to know? The question "Is all research connected via citations, if we count all mentions and references in any resources?" sounds rather uninteresting to me (we might choose to count Google Scholar, as well as the complete catalogues of all publishers as valid connecting resources, for instance) sounds rather uninteresting to me, compared to "Is all research connected via citations, if we only count a very specific set of citations?" Looking at the text above, the OP has alraeady answered which question they are asking. – O. R. Mapper Sep 09 '15 at 19:55
  • The global graph of citations (directed acyclic graph) produced need not to be a connected graph. Although paths do exist from one discipline to another and by philosophy all matter in this world are connected, clusters do form on a general basis. – Ébe Isaac Sep 08 '15 at 16:09
  • Pólya wrote a influential paper on combinatorics with applications to chemistry ("Kombinatorische Anzahlbestimmungen für Gruppen, Graphen und chemische Verbindungen", Acta Mathematica 68:1 (1937), pp 145-254), and that very same paper is routinely cited in works on computer science (combinatorics and analysis of algorithms, in particular, like Flajolet and Sedgewick's "Analytic Combinatorics", Cambridge University Press 2009). Not at all farfetched. – vonbrand Sep 08 '15 at 22:40
  • 1
    @O.R.Mapper And I'd argue that "Is all research connected via citations, if we only count a very specific set of citations?" feels a lot like "I need a network, but I don't like the network I picked, so I'm going to change it until the problem is easier." Frankly, I think how grey literature cites and gets cited (which will vary wildly by field in some cases based on things like the acceptability of arXiv and the prevalence of sandwich theses vs. traditional theses) would be fascinating. – Fomite Sep 10 '15 at 05:00
  • 1
    @O.R.Mapper Yes, that is how citations actually work, and moreover how they should work. My journal papers include citations to arXiv preprints, unpublished technical reports, unpublished PhD theses, Usenet newsgroup posts, blog posts, and StackExchange answers, because those were the correct primary references. Conversely, preprints, technical reports, blog posts, and StackExchange posts that I have written have been cited in multiple journal articles because those were the correct primary references. – JeffE Sep 10 '15 at 21:25
  • @JeffE: I think the issue here is that no-one has defined what is meant by "how citations work" in the context of this question. As was stated in another comment, if Stack Exchange posts are included, as soon as someone posts an answer with an example of two absolutely disconnected papers, they will not be disconnected any more by virtue of that very answer. This is certainly one interpretation of how citations work, but it is one by which the question at hand ... – O. R. Mapper Sep 10 '15 at 21:39
  • ... becomes, in my opinion, quite uninteresting. In my opinion, a much more interesting question is whether the set of papers is still connected if we restrict the considered citation links to a specific subset of all existing citations. And, based on the OP's statements, that seems to be the question the OP intended to ask here. I personally do not find the question of connectedness across any kind of citation very interesting, but others might - yet, it is a different question to this one, and as such, it might be worthwhile to actually ask it as a related, but separate Academia SE question. – O. R. Mapper Sep 10 '15 at 21:39

5 Answers5

18

When looking at the paper-is-cited-by-paper binary relation in an undirected manner: Is all research connected to each other or will there be clusters?

First, it's possible to have an unconnected paper. Unlikely, but possible, to write something that has no citations and is never cited. This is clearly an edge case, but you did say all.

Second, all research can connect, but there can still be clusters. The two are not mutually exclusive - fields will cluster, presumably, but there will then be some links between fields. My suspicion, from personal experience, is that these papers will be methodological in focus primarily.

In other words: Is there a way from a, say, computer science paper to a chemistry paper by walking along the citation graph?

Third, yes, you could get from a CS paper to a Chemistry paper. I know this because I can trace that tree in my own work, and the work my work cites. The path isn't even all that convoluted or exciting.

Fomite
  • 51,973
  • 5
  • 115
  • 229
  • 1
    I understand cluster in the strict sense, i.e. a cluster consists of everything that is connected. In that case, the two are mutually exclusive. – knub Sep 08 '15 at 20:23
  • 1
    @knub in the strict sense, that's a clique, not a cluster. – OrangeDog Sep 09 '15 at 10:42
  • 2
    @OrangeDog What knub describes is a connected component. A clique would be a subset of papers that all cite each other directly, ie. one step. – Peter Bloem Sep 09 '15 at 10:50
  • 1
    A clique is fully connected. The term would be connected component, which I just edited in the question. – knub Sep 09 '15 at 10:51
  • 1
    Doh. Well at least we all agree it's not a cluster. – OrangeDog Sep 09 '15 at 10:54
  • @knub I'd suggest that a connected component is an overly strict definition for actually understanding citation graphs. Lets say there's a field that cites only itself, meeting your definition. Hundreds, nay thousands of papers, all as a connected component. As insular a field as you can get. Then one paper is cited in a literature review of another field as a brief aside. Is that really a meaningful change? For your definition, it is. I'd suggest viewing clustering in the "clustering coefficient" sense, which allows a little more nuance. – Fomite Sep 09 '15 at 17:49
  • @Formite: I think it is okay, if only one paper connects two fields, that's the interesting ones. The "as a brief side" aspect is really hard to judge, compared to just looking at citations. As the opinion so far is that there is no giant component, I think it is not necessary to increase the requirements now. – knub Sep 09 '15 at 18:32
  • @knub, there probably is a giant component. As you can see in my answer, the proportion of disconnected papers is tiny. So the whole graph is not connected, but there is a giant component. – Peter Bloem Sep 09 '15 at 19:13
  • @Peter: Sorry, I was unclear again. I meant one component connecting everything and nothing else. – knub Sep 09 '15 at 19:17
  • @knub I think given the consensus is that there is a single connected component, asking about clusters within that component remains far more interesting than just "Giant component: Yes/No". But this is headed toward something that should probably go in chat. – Fomite Sep 09 '15 at 19:41
13

Most likely not (but almost). Apart from the specific papers without references that people have already mentioned, we can look at crawls of large collections of papers. Take, for example, this dataset of all papers in the theoretical high-energy physics section of ArXiv:

Nodes 27770
Nodes in largest WCC 27400 (0.987)

The largest weakly connected component (WCC) is what you're after: the largest subset of papers that are connected to each other by a path of citations (ignoring direction). While the largest WCC is almost as big as the entire graph, there are papers outside it. Usually, with graph like this, these form little clusters of their own.

For a more cross-domain dataset, consider the citeseer graph, again a small proportion of papers, outside the largest WCC.

Now, of course, these datasets don't contain all of academia, and adding more papers would mean connecting some islands to the WCC, but I'd say adding more papers also adds new little islands. No matter what rule you use to decide which papers count and which don't, I think you always end up with disconnected islands.

Of course, if your question is whether any randomly chosen paper in domain A is likely to be connected with a random paper in domain B, the answer is yes. There will be a large WCC encompassing all domains, and a few tiny islands. I've seen visualizations to this effect, but unfortunately I can't find them at the moment.

Peter Bloem
  • 6,890
  • 26
  • 32
  • 4
    Interestingly (by the citeseer stats), for any two papers that are both in the WCC, you have an expected shortest path of 6.35 citations. So while there may be disconnected papers, most of academia is pretty well connected. – Peter Bloem Sep 09 '15 at 00:18
  • 2
    Just as a comment, for me the best network graph visualization of the arxiv database is this: http://paperscape.org/. You can clearly see clusters and clusters close to each other resemble very well the connection between research fields. If you look for famous big physicists you will find them connected everywhere. – Santiago Sep 09 '15 at 10:52
  • @Santi The best network graph visualization of the physics subset of the arxiv database, please. – JeffE Sep 09 '15 at 11:59
  • 1
    @JeffE it is actually the whole arxiv, which includes also quantitative finance, quantitative biology, math and computer science. Of course, as you can see, they are just very faint clusters in the lower left corner, because physics dominates the landscape. – Santiago Sep 09 '15 at 12:17
6

In short, no. For starters, and perhaps surprisingly to many folks, a large number of papers in the scholarly literature have neither incoming nor outgoing citations. These obviously won't be connected to anything else. Let's throw these out and look only that set of papers that do cite journal articles or do receive citations from journal articles. Even then, not all articles are connection by citation relationships.

Depending on which data set one uses, the fraction of papers in the giant component of a citation graph will vary, but in my experience typically 90-95% of papers will be in this giant component and the rest will be singletons or members of small connected components.

Corvus
  • 18,086
  • 6
  • 63
  • 82
  • 6
    Note that the OP explicitly asks for undirected paths. So even if one paper has no citations at all, it will still be connected to the larger academic world if it is itself cited. – Stephan Kolassa Sep 08 '15 at 16:12
  • 2
    I should have been clearer in my answer. I propose throwing out all articles with no outgoing citations and no incoming citations, and then looking to see if the remainder are all members of single connected component. One will inevitably find that they are not. – Corvus Sep 08 '15 at 16:55
  • 2
    @Corvus Can you give an example of a paper that is in not in the giant component and a paper that is in the giant component? – JiK Sep 08 '15 at 19:29
  • 6
    @JiK hey, that is a trap! As soon as he does it, we have both articles connected here in this page and then both going to the giant component. – arivero Sep 09 '15 at 02:29
  • 1
    What you say makes sense and is what I would expect but it seems to be purely based on conjecture. I would expect no path from a paper on the dietary habits of 13th century farmers to one on Newman modularity but that doesn't mean there won't be one. I don't see how this can be answered without actually building the graph and checking. – terdon Sep 09 '15 at 13:55
  • 2
    @arivero: Indeed - this highlights how it makes little sense to examine this question without establishing what set of documents is considered (e.g. only peer-reviewed publications). Otherwise, we could, for instance, argue that a large number of otherwise disconnected papers are indeed connected ... via appearing in the Google Scholar database, which forms one huge document full of outgoing citations. – O. R. Mapper Sep 09 '15 at 14:15
  • @O.R.Mapper: I did establish a set of documents, right? I said papers published to conferences/journal, which inherently includes peer reviews. I further constrained to "good", established conferences/journals to rule out bogus, very low-quality conferences/journals. – knub Sep 09 '15 at 17:26
  • 1
    @knub: Maybe you did, but I was responding to a comment that implied counting something like Stack Exchange postings as well. – O. R. Mapper Sep 09 '15 at 17:59
2

I don't see how this can be answered short of actually generating the relevant graph and analyzing it. All answers here are simply conjecture. Very reasonable conjecture, based on valid assumptions and reaching conclusions that I find very plausible, but conjecture nonetheless.

I would expect most papers to be connected by a (sometimes very long) path, but not all. However, there is no way of demonstrating this unless we have the graph. I'd be interested to check this, by the way. If anyone has an idea of how to get the relevant nodes and edges, I'd be happy to give it a go.

Until then, I'm afraid this question has to remain unanswered.

terdon
  • 2,451
  • 19
  • 20
-4

I'm sure, each paper is connected to all others. The question is, how long is this citation way from one paper to the other. This is the sort of the 5 handshakes rule thing.

I bet, one can even prove that all academic texts are interconnected through the works of, for example, Sigmund Freud.

ikashnitsky
  • 1,369
  • 1
  • 11
  • 23
  • This is a bet I would be very happy to take! It would not take long to provide a counter example given any particular bibliometric dataset, given that you can find the connected components of a graph in linear time (Hopcroft, J.; Tarjan, R. (1973). "Algorithm 447: efficient algorithms for graph manipulation". Communications of the ACM 16 (6): 372–378.) and that you will never find everything contained in a single giant component. – Corvus Sep 08 '15 at 16:59
  • 2
    @Corvus, then, take it. A cool reference does not mean you are automatically right – ikashnitsky Sep 08 '15 at 17:11
  • 1
    You could in principle publish a completely spurious paper that cites one paper in the giant component plus everything you want to pull into it. Therefore any example Corvus gives is liable to be nobbled by someone who wants it to be wrong, and it is impossible to publish a paper that actually cites an example of each kind of paper ;-) – Steve Jessop Sep 08 '15 at 21:02
  • @stevejessop "A reductionist approach to unifying the citation graph", J. Bibliometrics (2017). I can see it now... – Andrew is gone Sep 09 '15 at 07:43
  • @SteveJessop: I hope my question is not that important, that someone tries to publish such a bogus paper to an established conference, and it gets accepted. – knub Sep 09 '15 at 17:22