SlideShare a Scribd company logo
1 of 1
Download to read offline
US 2015/342913 C00015 This WorkSCHEMBL17244875 / CID 118501484
V denotes cycloalkyl, heterocyclyl, aryl or heteroaryl,
W denotes , heterocyclyl or heteroaryl;
Generic features are recognised and captured in the output format. This allows a large
scale summary of the features over U.S. Patents since 2001. Using the first sketch from the
claims of each patent, 200,900 generic molecules with high/medium confidence can be
extracted. Their feature distribution is:
John	May,	Daniel	Lowe	and	Roger	Sayle	
NextMove	Software	Ltd,	Cambridge,	UK.
NextMove	Software	Limited	
Innovation	Centre	(Unit	23)	
Cambridge	Science	Park	
Milton	Road,	Cambridge	
UK		CB4	0EY	
www.nextmovesoftware.com
Introduction
We would like to thank MineSoft Ltd for funding and George Papadatos for providing bulk
downloads of structures from SureChEMBL for the Dec 2015 Patent Applications.
Sketchy Sketches:
Hiding Chemistry in Plain Sight
Sketch formats, such as ChemDraw, are often preferred for publication as they provide
more freedom to the author. This freedom comes at a cost and what is displayed to the
user may not coincide with the stored chemical representation. This can lead to incorrect
structures if naïvely exported. More than 24 million ChemDraw files are available from the
U.S. Patents, using these files a high fidelity converter/interpreter has been developed that
is able to extract and associate more chemistry at higher precision.
Content Categorisation
Acknowledgements
Evaluation
Examples of Improved Interpretation
Using 2,528[3] U.S. Patent Applications published in Dec 2015 the structures extracted by
different methods are compared:
A categorisation scheme has been developed to organise and characterise diverse sketch
content. Sketches are assigned a content type (1), level of detail (2), and a confidence that
the interpretation is correct (3):
Substituent is assigned if the entity contains an attachment point and are commonly found
in Markush claims. NoConnectionTable is assigned to non-chemical sketches and when
an extracted connection table is not believed to be real (e.g. chessboardanes[1]).
1. https://cdsouthan.blogspot.co.uk/2015/07/chessboardane-and-other-strange-patent.html
2. Heifets, A and Jurisica, I. SCRIPDB: a portal for easy access to syntheses, chemicals and
reactions in patents. Nucl. Acids. Res. 2012
3. Papadatos, G et al. SureChEMBL: a large-scale, chemically annotated patent document
database. Nucl. Acids. Res. 2015
4. All U.S. patents applications in Dec 2015 with chemical complex work units.
Generic is assigned when substituent, positional, or frequency variation is present. This
primarily includes Markush cores but also polymers. Specific is assigned when all atom
labels have been resolved. An entity with unresolved labels and no variation are Unknown,
the labels are usually partial, ambiguous, or incorrect condensed formula.
US 2015/370940 C00005
NoConnectionTable
US 2015/0344503 C00668
Molecule/Specific/High
US 2007/129372 C00001
Molecule/Generic/High
US 2015/366987 C00008
Substituent/Specific/High
Substituent/Generic/High
US 2015/368236 C00019US 2015/378257 C00118
Molecule/Generic/High
US 2015/378260 C00003
Molecule/Unknown
US 2015/366987 C00008 SCHEMBL6726025 / CID 69879541 This Work
Attachment Points and Punctuation Striping. Attachment points may be misrepresented
as tert-butyl, iso-propyl, or methyl groups. Punctuation can be used to separate structure
lists, if appended to an atom label that label may not be understood.
Condensed Formula. Formulas are reinterpreted with a more accurate algorithm. Some
labels previously defined will be unrecognised (Molecule/Unknown) whilst others are
corrected.
This WorkUS 2015/376182 C00030
Substituent Labels. Markush definitions are recognised in the patent text and used to
resolve ambiguities. W for Tungsten? V for Vanadium?
US 2015/376182 [Claim 1] Document
Bibliography
This WorkUS 2004/101442 C00025 Molfile Contents
Generic Feature Survey
SCRIPDB[2] and SureChEMBL[3] include structures from directly converting bundled
Molfiles. SureChEMBL also includes structures obtained from raster images using image-
to-structure. The raw data is used in this evaluation, SureChEMBL filters structures before
indexing in the public interface.
Normalisation. Output from CDX and TXT are split into individual components (e.g.
splitting reactions and salts) to make it comparable to IMG and MOL which had already
been split. Only components with ≥12 bonds between heavy atoms are considered.
File Content. The 2,528 patents contain 136,372 ChemDraw files with the categories:
Files can contain multiple entities, the numbers reported here are by file rather than entity. By entity
Molecule/Specific is 106,402, Reaction/Specific is 16,388.
Specific
72,520
Generic
26,743
Unknown
2,045
Specific
9,674
Generic
4,774
Unknown
698
Specific
11,647
Generic
6,653
Unknown
344
Molecule Reaction Substituent
Frequency Variation Positional Variation Substituent Variation
0
20000
40000
60000
80000
100000
120000
140000
160000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Features
NumPatents
Overlap. Structures are compared using Canonical SMILES (OEChem) of the individual
components on a per document basis. For CDX only content categorised as specific
molecules and reactions is counted.
Whole Document Claims Section
Differences. The biggest contribution to differences between the CDX, IMG and MOL are
misinterpreted substituents and condensed formulae (see. left panel). Some differences are
a result of human input error.
US 2015/380663 C00381
Conclusion
Interpreting ChemDraw files shows closer agreement with chemistry mined from the patent
text than image-to-structure and much closer agreement than Molfile conversion. Reading
the ChemDraw files, rather than the derived Molfiles, allows extraction of complex concepts
such as reaction schemes and generic structures. The categorisation helps identify
problematic structures and highlights areas to be improved. A major advantage of the
ChemDraw interpretation is speed, taking 5h on a single-core to process the back
catalogue of 24+ million.
SCHEMBL17366912 / CID 118602343 This Work
Unique Component-Document Associations
Detail: Specific|Generic|Unknown
Type: Molecule|Reaction|Substituent|NoConnectionTable
Confidence: High|Medium|Low
1
2
3
Not Found in TXT Also Found in TXT
CDX 49,119 36,829 (42.8%)
IMG 49,836 35,545 (41.6%)
MOL 58,169 28,926 (33.2%)
Not Found in TXT Also Found in TXT
CDX 10,664 800 (6.9%)
MOL 11,078 796 (6.7%)
IMG 11,783 689 (5.5%)
CDX
IMG
MOL
TXT
0
25000
50000
75000
100000
125000
150000
175000
200000
225000
250000
Count (Whole Document)
0
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
Count (Claims Section)
Generic Mol/Rxn
Specific Mol/Rxn
Substituent
Molecule
101,308
Reaction
15,146
Substituent
18,644
NoConnectionTable
1,274
PATENT
XML
CDX
MOL
TIF
Exported by ChemDraw
Name-to-Structure
(NextMove's LeadMine)
Image-to-Structure
(SureChEMBL,CLiDE)
Molfile Reader
(SCRIPDB,SureChEMBL)
ChemDraw Interpreter
(This Work)
LeadMine
CDX
IMG
MOL
TXT
Patent and Data Bundle Available from USPTO

More Related Content

Viewers also liked

2016 global outsourcing survey infographic
2016 global outsourcing survey infographic2016 global outsourcing survey infographic
2016 global outsourcing survey infographicDeloitte United States
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightNextMove Software
 
Student Project MECH S
Student Project MECH SStudent Project MECH S
Student Project MECH SDalton Goodwin
 
5 Reasons to Support Cybersecurity Information Sharing Act (CISA)
5 Reasons to Support Cybersecurity Information Sharing Act (CISA)5 Reasons to Support Cybersecurity Information Sharing Act (CISA)
5 Reasons to Support Cybersecurity Information Sharing Act (CISA)U.S. Chamber of Commerce
 
Pair Programming demystified
Pair Programming demystifiedPair Programming demystified
Pair Programming demystifiedDaftcode
 
The Future Of Work & The Work Of The Future
The Future Of Work & The Work Of The FutureThe Future Of Work & The Work Of The Future
The Future Of Work & The Work Of The FutureArturo Pelayo
 
Guided Reading: Making the Most of It
Guided Reading: Making the Most of ItGuided Reading: Making the Most of It
Guided Reading: Making the Most of ItJennifer Jones
 

Viewers also liked (8)

2016 global outsourcing survey infographic
2016 global outsourcing survey infographic2016 global outsourcing survey infographic
2016 global outsourcing survey infographic
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sight
 
Student Project MECH S
Student Project MECH SStudent Project MECH S
Student Project MECH S
 
5 Reasons to Support Cybersecurity Information Sharing Act (CISA)
5 Reasons to Support Cybersecurity Information Sharing Act (CISA)5 Reasons to Support Cybersecurity Information Sharing Act (CISA)
5 Reasons to Support Cybersecurity Information Sharing Act (CISA)
 
Meetings
MeetingsMeetings
Meetings
 
Pair Programming demystified
Pair Programming demystifiedPair Programming demystified
Pair Programming demystified
 
The Future Of Work & The Work Of The Future
The Future Of Work & The Work Of The FutureThe Future Of Work & The Work Of The Future
The Future Of Work & The Work Of The Future
 
Guided Reading: Making the Most of It
Guided Reading: Making the Most of ItGuided Reading: Making the Most of It
Guided Reading: Making the Most of It
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Recently uploaded

Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 

Recently uploaded (20)

Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 

Sketchy Sketches: Hiding Chemistry in Plain Sight

  • 1. US 2015/342913 C00015 This WorkSCHEMBL17244875 / CID 118501484 V denotes cycloalkyl, heterocyclyl, aryl or heteroaryl, W denotes , heterocyclyl or heteroaryl; Generic features are recognised and captured in the output format. This allows a large scale summary of the features over U.S. Patents since 2001. Using the first sketch from the claims of each patent, 200,900 generic molecules with high/medium confidence can be extracted. Their feature distribution is: John May, Daniel Lowe and Roger Sayle NextMove Software Ltd, Cambridge, UK. NextMove Software Limited Innovation Centre (Unit 23) Cambridge Science Park Milton Road, Cambridge UK CB4 0EY www.nextmovesoftware.com Introduction We would like to thank MineSoft Ltd for funding and George Papadatos for providing bulk downloads of structures from SureChEMBL for the Dec 2015 Patent Applications. Sketchy Sketches: Hiding Chemistry in Plain Sight Sketch formats, such as ChemDraw, are often preferred for publication as they provide more freedom to the author. This freedom comes at a cost and what is displayed to the user may not coincide with the stored chemical representation. This can lead to incorrect structures if naïvely exported. More than 24 million ChemDraw files are available from the U.S. Patents, using these files a high fidelity converter/interpreter has been developed that is able to extract and associate more chemistry at higher precision. Content Categorisation Acknowledgements Evaluation Examples of Improved Interpretation Using 2,528[3] U.S. Patent Applications published in Dec 2015 the structures extracted by different methods are compared: A categorisation scheme has been developed to organise and characterise diverse sketch content. Sketches are assigned a content type (1), level of detail (2), and a confidence that the interpretation is correct (3): Substituent is assigned if the entity contains an attachment point and are commonly found in Markush claims. NoConnectionTable is assigned to non-chemical sketches and when an extracted connection table is not believed to be real (e.g. chessboardanes[1]). 1. https://cdsouthan.blogspot.co.uk/2015/07/chessboardane-and-other-strange-patent.html 2. Heifets, A and Jurisica, I. SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents. Nucl. Acids. Res. 2012 3. Papadatos, G et al. SureChEMBL: a large-scale, chemically annotated patent document database. Nucl. Acids. Res. 2015 4. All U.S. patents applications in Dec 2015 with chemical complex work units. Generic is assigned when substituent, positional, or frequency variation is present. This primarily includes Markush cores but also polymers. Specific is assigned when all atom labels have been resolved. An entity with unresolved labels and no variation are Unknown, the labels are usually partial, ambiguous, or incorrect condensed formula. US 2015/370940 C00005 NoConnectionTable US 2015/0344503 C00668 Molecule/Specific/High US 2007/129372 C00001 Molecule/Generic/High US 2015/366987 C00008 Substituent/Specific/High Substituent/Generic/High US 2015/368236 C00019US 2015/378257 C00118 Molecule/Generic/High US 2015/378260 C00003 Molecule/Unknown US 2015/366987 C00008 SCHEMBL6726025 / CID 69879541 This Work Attachment Points and Punctuation Striping. Attachment points may be misrepresented as tert-butyl, iso-propyl, or methyl groups. Punctuation can be used to separate structure lists, if appended to an atom label that label may not be understood. Condensed Formula. Formulas are reinterpreted with a more accurate algorithm. Some labels previously defined will be unrecognised (Molecule/Unknown) whilst others are corrected. This WorkUS 2015/376182 C00030 Substituent Labels. Markush definitions are recognised in the patent text and used to resolve ambiguities. W for Tungsten? V for Vanadium? US 2015/376182 [Claim 1] Document Bibliography This WorkUS 2004/101442 C00025 Molfile Contents Generic Feature Survey SCRIPDB[2] and SureChEMBL[3] include structures from directly converting bundled Molfiles. SureChEMBL also includes structures obtained from raster images using image- to-structure. The raw data is used in this evaluation, SureChEMBL filters structures before indexing in the public interface. Normalisation. Output from CDX and TXT are split into individual components (e.g. splitting reactions and salts) to make it comparable to IMG and MOL which had already been split. Only components with ≥12 bonds between heavy atoms are considered. File Content. The 2,528 patents contain 136,372 ChemDraw files with the categories: Files can contain multiple entities, the numbers reported here are by file rather than entity. By entity Molecule/Specific is 106,402, Reaction/Specific is 16,388. Specific 72,520 Generic 26,743 Unknown 2,045 Specific 9,674 Generic 4,774 Unknown 698 Specific 11,647 Generic 6,653 Unknown 344 Molecule Reaction Substituent Frequency Variation Positional Variation Substituent Variation 0 20000 40000 60000 80000 100000 120000 140000 160000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Features NumPatents Overlap. Structures are compared using Canonical SMILES (OEChem) of the individual components on a per document basis. For CDX only content categorised as specific molecules and reactions is counted. Whole Document Claims Section Differences. The biggest contribution to differences between the CDX, IMG and MOL are misinterpreted substituents and condensed formulae (see. left panel). Some differences are a result of human input error. US 2015/380663 C00381 Conclusion Interpreting ChemDraw files shows closer agreement with chemistry mined from the patent text than image-to-structure and much closer agreement than Molfile conversion. Reading the ChemDraw files, rather than the derived Molfiles, allows extraction of complex concepts such as reaction schemes and generic structures. The categorisation helps identify problematic structures and highlights areas to be improved. A major advantage of the ChemDraw interpretation is speed, taking 5h on a single-core to process the back catalogue of 24+ million. SCHEMBL17366912 / CID 118602343 This Work Unique Component-Document Associations Detail: Specific|Generic|Unknown Type: Molecule|Reaction|Substituent|NoConnectionTable Confidence: High|Medium|Low 1 2 3 Not Found in TXT Also Found in TXT CDX 49,119 36,829 (42.8%) IMG 49,836 35,545 (41.6%) MOL 58,169 28,926 (33.2%) Not Found in TXT Also Found in TXT CDX 10,664 800 (6.9%) MOL 11,078 796 (6.7%) IMG 11,783 689 (5.5%) CDX IMG MOL TXT 0 25000 50000 75000 100000 125000 150000 175000 200000 225000 250000 Count (Whole Document) 0 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 Count (Claims Section) Generic Mol/Rxn Specific Mol/Rxn Substituent Molecule 101,308 Reaction 15,146 Substituent 18,644 NoConnectionTable 1,274 PATENT XML CDX MOL TIF Exported by ChemDraw Name-to-Structure (NextMove's LeadMine) Image-to-Structure (SureChEMBL,CLiDE) Molfile Reader (SCRIPDB,SureChEMBL) ChemDraw Interpreter (This Work) LeadMine CDX IMG MOL TXT Patent and Data Bundle Available from USPTO