I really like the recent work by S. Joshua Swamidass and Pierre Baldi on comparing binary fingerprints [1], [2].
I'll focus on ref [1] here. In this study, Swamidass and Baldi show that a simple comparison of the number of bits that are set can quickly rule out particular compounds having a Tanimoto coefficient of greater than a certain value. To be exact, the maximum possible similarity that two fingerprints can have is min(A,B)/max(A,B) where A and B are the number of bits set in each fingerprint.
In other words, if you are looking for all compounds in a library that have a Tanimoto coefficient with a target compound of greater than 0.7, you probably don't have to consider the majority of the compounds simply on the basis of the number of bits they have set. This simple rule can make this procedure much more efficient.
But the authors don't stop there. They decided to completely nail this question, so they look at other types of similarity measure, and at the bounds on similarity when using a multiple-molecule query. And if that wasn't enough, they then lay on the maths and simulations, and go after equations for the resulting speedup for fingerprints of different lengths and different similarity thresholds.
This paper has got it all. It's comprehensive, thorough, and has a really useful take home message that all cheminformaticians who deal with fingerprints should bear in mind.
References:
[1] Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time
S. Joshua Swamidass and Pierre Baldi
J. Chem. Inf. Model., 47 (2), 302-317, 2007.
[2] Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval
S. Joshua Swamidass and Pierre Baldi
J. Chem. Inf. Model., 47 (3), 952 -964, 2007.
Tuesday, 5 June 2007
Subscribe to:
Posts (Atom)