Ubiquitous in published literature, chemical sketches[1] combine molecular connection tables with additional drawing elements, free text, and embedded images. Unlike connection table formats that precisely capture chemistry for database entry, the primary purpose of a sketch format is to produce a high quality image for conveying information to other chemists. The presence of additional elements present in a sketch may either extend or change the chemical interpretation leading to incorrect structures if naively exported.
The variability and flexibility of sketching formats presents several challenges including:
– Non-chemical elements used to represent chemical features (e.g. arcs for a bond)
– Chemical elements used for non-chemical features (e.g. ethane for a line, cyclobutane for a table cell)
– Contracted labels that need expanding or correcting (e.g. HOOC-, O2N-, K2CO3)
– Distinction between generic labels and chemical elements (e.g. B, C, D, P, V, W, Y, Ac, Ar)
– Reagent abbreviations (e.g. HATU, COMU, THF, DMSO)
– Compound identifiers (e.g. “(II)”, “(IV)”, “(1a)”)
– Bond interpretation (e.g. wavy or dative bond used for attachment points)
– Free text overlaid on the structure (e.g. “( )n” for a link node)
– Assignment of reaction role
Being able to successfully read, interpret, and understand the chemist’s intention improves the quality and utility of the data for searching and analysis.
Since 2001 the USPTO has redrawn chemical sketches with ChemDraw for all patent applications and grants. Totalling more than 24+ million CDX files (10+ million from grants, 14+ million from applications) this data set provides a large collection in which to identify conventions and rectify systematic problems. By identifying and resolving these problems, we have extracted a large high-quality collection of fragments, molecules, reactions, and schemes. Generic structural features such as frequency, positional, substituent, and homology variation are also captured. The extracted chemical data is summarised and compared to that obtained through text-mining (NextMove Software’s LeadMine), image-to-structure (SureChEMBL), and direct export (SCRIPDB).
[1] Sketch formats include ChemDraw (CDX/CDXML), ACD/ChemSketch (SK2), BioVia Draw (SKC), and MarvinSketch (MRV)
1. US 2015/342913 C00015 This WorkSCHEMBL17244875 / CID 118501484
V denotes cycloalkyl, heterocyclyl, aryl or heteroaryl,
W denotes , heterocyclyl or heteroaryl;
Generic features are recognised and captured in the output format. This allows a large
scale summary of the features over U.S. Patents since 2001. Using the first sketch from the
claims of each patent, 200,900 generic molecules with high/medium confidence can be
extracted. Their feature distribution is:
John May, Daniel Lowe and Roger Sayle
NextMove Software Ltd, Cambridge, UK.
NextMove Software Limited
Innovation Centre (Unit 23)
Cambridge Science Park
Milton Road, Cambridge
UK CB4 0EY
www.nextmovesoftware.com
Introduction
We would like to thank MineSoft Ltd for funding and George Papadatos for providing bulk
downloads of structures from SureChEMBL for the Dec 2015 Patent Applications.
Sketchy Sketches:
Hiding Chemistry in Plain Sight
Sketch formats, such as ChemDraw, are often preferred for publication as they provide
more freedom to the author. This freedom comes at a cost and what is displayed to the
user may not coincide with the stored chemical representation. This can lead to incorrect
structures if naïvely exported. More than 24 million ChemDraw files are available from the
U.S. Patents, using these files a high fidelity converter/interpreter has been developed that
is able to extract and associate more chemistry at higher precision.
Content Categorisation
Acknowledgements
Evaluation
Examples of Improved Interpretation
Using 2,528[3] U.S. Patent Applications published in Dec 2015 the structures extracted by
different methods are compared:
A categorisation scheme has been developed to organise and characterise diverse sketch
content. Sketches are assigned a content type (1), level of detail (2), and a confidence that
the interpretation is correct (3):
Substituent is assigned if the entity contains an attachment point and are commonly found
in Markush claims. NoConnectionTable is assigned to non-chemical sketches and when
an extracted connection table is not believed to be real (e.g. chessboardanes[1]).
1. https://cdsouthan.blogspot.co.uk/2015/07/chessboardane-and-other-strange-patent.html
2. Heifets, A and Jurisica, I. SCRIPDB: a portal for easy access to syntheses, chemicals and
reactions in patents. Nucl. Acids. Res. 2012
3. Papadatos, G et al. SureChEMBL: a large-scale, chemically annotated patent document
database. Nucl. Acids. Res. 2015
4. All U.S. patents applications in Dec 2015 with chemical complex work units.
Generic is assigned when substituent, positional, or frequency variation is present. This
primarily includes Markush cores but also polymers. Specific is assigned when all atom
labels have been resolved. An entity with unresolved labels and no variation are Unknown,
the labels are usually partial, ambiguous, or incorrect condensed formula.
US 2015/370940 C00005
NoConnectionTable
US 2015/0344503 C00668
Molecule/Specific/High
US 2007/129372 C00001
Molecule/Generic/High
US 2015/366987 C00008
Substituent/Specific/High
Substituent/Generic/High
US 2015/368236 C00019US 2015/378257 C00118
Molecule/Generic/High
US 2015/378260 C00003
Molecule/Unknown
US 2015/366987 C00008 SCHEMBL6726025 / CID 69879541 This Work
Attachment Points and Punctuation Striping. Attachment points may be misrepresented
as tert-butyl, iso-propyl, or methyl groups. Punctuation can be used to separate structure
lists, if appended to an atom label that label may not be understood.
Condensed Formula. Formulas are reinterpreted with a more accurate algorithm. Some
labels previously defined will be unrecognised (Molecule/Unknown) whilst others are
corrected.
This WorkUS 2015/376182 C00030
Substituent Labels. Markush definitions are recognised in the patent text and used to
resolve ambiguities. W for Tungsten? V for Vanadium?
US 2015/376182 [Claim 1] Document
Bibliography
This WorkUS 2004/101442 C00025 Molfile Contents
Generic Feature Survey
SCRIPDB[2] and SureChEMBL[3] include structures from directly converting bundled
Molfiles. SureChEMBL also includes structures obtained from raster images using image-
to-structure. The raw data is used in this evaluation, SureChEMBL filters structures before
indexing in the public interface.
Normalisation. Output from CDX and TXT are split into individual components (e.g.
splitting reactions and salts) to make it comparable to IMG and MOL which had already
been split. Only components with ≥12 bonds between heavy atoms are considered.
File Content. The 2,528 patents contain 136,372 ChemDraw files with the categories:
Files can contain multiple entities, the numbers reported here are by file rather than entity. By entity
Molecule/Specific is 106,402, Reaction/Specific is 16,388.
Specific
72,520
Generic
26,743
Unknown
2,045
Specific
9,674
Generic
4,774
Unknown
698
Specific
11,647
Generic
6,653
Unknown
344
Molecule Reaction Substituent
Frequency Variation Positional Variation Substituent Variation
0
20000
40000
60000
80000
100000
120000
140000
160000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Features
NumPatents
Overlap. Structures are compared using Canonical SMILES (OEChem) of the individual
components on a per document basis. For CDX only content categorised as specific
molecules and reactions is counted.
Whole Document Claims Section
Differences. The biggest contribution to differences between the CDX, IMG and MOL are
misinterpreted substituents and condensed formulae (see. left panel). Some differences are
a result of human input error.
US 2015/380663 C00381
Conclusion
Interpreting ChemDraw files shows closer agreement with chemistry mined from the patent
text than image-to-structure and much closer agreement than Molfile conversion. Reading
the ChemDraw files, rather than the derived Molfiles, allows extraction of complex concepts
such as reaction schemes and generic structures. The categorisation helps identify
problematic structures and highlights areas to be improved. A major advantage of the
ChemDraw interpretation is speed, taking 5h on a single-core to process the back
catalogue of 24+ million.
SCHEMBL17366912 / CID 118602343 This Work
Unique Component-Document Associations
Detail: Specific|Generic|Unknown
Type: Molecule|Reaction|Substituent|NoConnectionTable
Confidence: High|Medium|Low
1
2
3
Not Found in TXT Also Found in TXT
CDX 49,119 36,829 (42.8%)
IMG 49,836 35,545 (41.6%)
MOL 58,169 28,926 (33.2%)
Not Found in TXT Also Found in TXT
CDX 10,664 800 (6.9%)
MOL 11,078 796 (6.7%)
IMG 11,783 689 (5.5%)
CDX
IMG
MOL
TXT
0
25000
50000
75000
100000
125000
150000
175000
200000
225000
250000
Count (Whole Document)
0
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
Count (Claims Section)
Generic Mol/Rxn
Specific Mol/Rxn
Substituent
Molecule
101,308
Reaction
15,146
Substituent
18,644
NoConnectionTable
1,274
PATENT
XML
CDX
MOL
TIF
Exported by ChemDraw
Name-to-Structure
(NextMove's LeadMine)
Image-to-Structure
(SureChEMBL,CLiDE)
Molfile Reader
(SCRIPDB,SureChEMBL)
ChemDraw Interpreter
(This Work)
LeadMine
CDX
IMG
MOL
TXT
Patent and Data Bundle Available from USPTO