18

I have been struggling to find an acceptable answer for this question for my purposes.

There are many ways to find similarity between two organic compounds - some of which are particularly popular in chemoinformatics. The seemingly most popular way is to use fingerprints of molecules, which then somehow correlates to the structure/function of various parts of the molecule. This approach seems very good when you're looking for general similarity between molecules. It also has the added benefit of making it fast to compare huge amounts of molecules, as each molecule can be encoded separately (so it's generally speaking $O(n)$), rather than comparing every pair of molecules (which would be $O(n^2)$).

However, for some purposes, this fingerprinting approach is not very good. I need a function of two molecules with the following properties:

  • If $A$ and $B$ are two molecules, then $f(A, B) \in [0, \infty).$
  • $f(A, A) = 0$
  • $f(A,B) + f(B,C) \geq f(A,C)$

These are similar to the basic requirements for a metric space. (The big omission is that I don't necessarily need $f(A,B) = f(B,A),$ although this certainly couldn't hurt!)

The reason that I want such a function is because I'm designing a neural network that essentially outputs a molecule, and I want to have some sort of error on the output.

I have played around a bit with Python-RDKit, and its similarity module, but haven't really been able to form a good "error" function from the output.


I've also experimented a bit with an algorithm I created that looks for the largest identical connected subgraphs of two molecules, and essentially finds matches for each part of the query molecule. The final output is then how many different parts the molecule needs to be split into to find a match.

For instance, if the "true" molecule is ethylbenzene, while the query molecule is m-xylene, then the algorithm would find that the m-xylene needs to be broken into 3 parts for each part to find a perfect match: a benzyl group missing a hydrogen at the third carbon of the aromatic ring, the methyl group that was formerly attached at the third carbon, and the hydrogen that remains from the benzyl group.

However, this algorithm suffers from several problems:

  1. If the query is a subset of the true answer, then the algorithm will always give the answer as 2 (one part is the subset, and the other part is the hydrogen that caps it - try seeing it with something like query - methane, true - acetic acid). This isn't that big of a concern, as you can simply run the algorithm twice - once comparing the query to the true molecule, and once in reverse. That way, if the two molecules are indeed far apart, it's not hard to see (the superset molecule may need to be broken into many pieces to be identical).

  2. This algorithm is slow. Don't really see a way to speed it up. It searches for maps from subsets of the query molecule to the true molecule, then gradually builds up from there. It also can't pick maps randomly as being the best, then growing from there, as picking the wrong direction can easily make the "best" map drastically short. So it has to do all possible maps at the same time. Which is slow.


In short, this is a somewhat open-ended question that boils down to:

How can we put a (loose) metric on the set of organic molecules?

M.A.R.
  • 10,576
  • 19
  • 71
  • 93
kosyumote
  • 381
  • 1
  • 8
  • Just out of curiosity: What to you want to achieve with this? – Karl Jun 28 '16 at 21:51
  • 3
    Short answer: train a neural network to deal with organic molecules (starting with interpreting SMILES, then moving up to interpreting NMR spectra, then maybe if these ideas work, to designing drugs... but this is all far away) – kosyumote Jun 28 '16 at 22:06
  • 1
    Longer: originally neural networks could only deal with fixed-length data. These days are gone, thanks to RNNs. This was the first step to realize that neural networks can deal with more complicated inputs. There's also been ideas to allow networks to have access to "external resources", in particular Turing Machines, leading to the innovation of an NTM. These networks performed amazingly on certain tasks that are suited to having an external memory. – kosyumote Jun 28 '16 at 22:09
  • 1
    This got me to thinking that - when I draw a chemical structure, I usually use a representation of the molecule to think about it. Thus, I am coupling neural networks with some graph/molecule manipulation machinery and seeing what happens.

    However, even if these ideas don't work out, it seems like a useful idea to have a nice measurement of "distance" between two molecules, in a mathematical sense. Being able to embed the set of organic molecules into some mathematical space would probably help in designing algorithms that search for molecules, providing that the embedding is useful.

    – kosyumote Jun 28 '16 at 22:09
  • 1
    You're not the first to have thought of this. ;-) https://en.wikipedia.org/wiki/List_of_computer-assisted_organic_synthesis_software – Karl Jun 28 '16 at 22:13
  • I figured I wasn't :) – kosyumote Jun 28 '16 at 22:15
  • 1
    Did you read any half-decent QSAR/QSPR introduction article ? – permeakra Jun 28 '16 at 22:22
  • 2
    Define half-decent -- or give an example of one? – kosyumote Jun 28 '16 at 22:23
  • 1
    Not particularly chemistry related (so posted as comment rather than an answer), but one possibility is to compute the various properties you would potentially be interested in, and then compute the Mahalanobis distance between the two property vectors. That should give you a metric value, but it's not used all that much, probably because of computational expense and the need for a fixed representative reference compound set to derive the covariances from. – R.M. Jun 29 '16 at 18:05
  • I assume that you are familiar with Tanimoto coefficient? – vapid Jul 01 '16 at 15:01
  • 1
    Correct me if I am wrong, but the point would nit be that YOU come up with a solution to the problem? I mean you have a problem that i am sure many already tried to solve similar ways to your approach. – Greg Mar 28 '18 at 19:01
  • Just curious, how many descriptors you tried? If you confine/ define the search space appropriately, descriptors defining fragments and groups seem to pretty well matching your criteria. – Greg Jul 16 '18 at 19:32
  • In answer to the original post: unless I am completely missing the point, I don't get why 'for some purposes, this fingerprinting approach is not very good', as you don't explain what you don't like about fingerprints. I agree with vapid, the Tanimoto coefficient (let's call it $T$) does pretty much what you want. Two identical molecules will give a Tanimoto similarity $T(A,A)=1$, and two molecules with no FP bits in common will give $T(A,B)=0$. So your $f(A,B)$ could simply be $\frac 1 {T(A,B)} - 1$. As for the third condition, not sure, but it should be easy to check. – user6376297 Sep 14 '18 at 17:11

2 Answers2

1

Using graphs to represent molecules, would it be reasonable to compute the distance between graphs A and B?

In a graph representation of a molecule, nodes are atoms and edges are bonds.

Example of an algorithm to compute distance between graphs: http://www.xavierdupre.fr/app/mlstatpy/helpsphinx/c_graph/graph_distance.html

diogom
  • 395
  • 1
  • 7
  • I recently employed this approach and I can warmly recommend it. I have to note though that you will run into the graph isomorphism problem. You can use a greedy algorithm (does not always give the correct result) or a brute force approach (very expensive) to calculate the distance metric. The greedy algorithm is suitable for large molecules and the brute force method is computationally feasible to graphs of less than 12 nodes. – Ivo Filot Nov 11 '18 at 11:32
1

I recently did something similar to this, I pulled out the murko scaffolds of my compound to be tested, and my test set. Then compared them to find matches, and from that pulled out the R groups that need to be attached to the scaffold to make the compound. As well as calculating the maximum common scaffold of the group with matching scaffolds.

The draw back is that it doesn't produce a number as such to display similarity scores.

pazchem
  • 94
  • 5