chemaxon.naming
Class DocumentExtractor

java.lang.Object
  extended by chemaxon.naming.DocumentExtractor

public class DocumentExtractor
extends java.lang.Object

Extracts chemical names from text documents and converts them to chemical structures. Example usage:

// We have a document to process
java.io.Reader document = ...;

DocumentExtractor x = new DocumentExtractor();
x.processHTML(document); // or processPlainText(document) for input in plain text format

// Iterate through the hits (using a Java 1.5 feature, otherwise use an java.util.Iterator)
for (Hit hit : x.getHits()) {
  System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles"));
}
The field hit.position contains the position of the first character of the name in the document.

Note that hit.text contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName().

This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.

Author:
Daniel Bonniot

Nested Class Summary
 class DocumentExtractor.Hit
          An occurrence of a chemical name in the processed document.
static class DocumentExtractor.ProgressInfo
           
static interface DocumentExtractor.ProgressListener
           
 
Field Summary
static java.lang.String propertyPage
           
static java.lang.String propertySourceDocument
           
 
Constructor Summary
DocumentExtractor()
          Creates a new document extractor.
DocumentExtractor(java.io.File document)
          If the file name ends with ".gz", the content will be uncompressed automatically.
DocumentExtractor(java.io.File document, java.lang.String encoding)
          If the file name ends with ".gz", the content will be uncompressed automatically.
DocumentExtractor(java.io.Reader r)
           
DocumentExtractor(java.lang.String text)
          Extract structures from a String.
DocumentExtractor(java.net.URL document)
           
DocumentExtractor(java.net.URLConnection document)
          Create a document extractor for the given URL connection.
 
Method Summary
 void acceptElements(boolean on)
           
 void acceptGenericNames(boolean on)
          Whether to accept generic, frequent names like "water".
 void acceptIons(boolean on)
           
 void clearHits()
          Clears the list of hits.
 java.util.List<DocumentExtractor.Hit> getHits()
          Returns the hits found in the documents processed so far.
static void main(java.lang.String[] args)
          Expects the name of a plain text file as the first argument (or from the standard input when absent).
static void printEncodingError()
           
 void processHTML()
          Extract names from an HTML document.
 void processHTML(DocumentExtractor.ProgressListener progressListener)
          Extract names from an HTML document.
 void processHTML(java.io.Reader r)
          Extract names from an HTML document.
 void processPlainText()
          Extract names from a plain text document.
 void processPlainText(DocumentExtractor.ProgressListener progressListener)
          Extract names from a plain text document.
 void processPlainText(java.io.Reader r)
          Extract names from a plain text document.
static DocumentExtractor readPDF(java.io.File pdf)
          Creates a DocumentExtractor to process the given PDF document.
static DocumentExtractor readPDF(java.io.InputStream pdfStream)
          Creates a DocumentExtractor to process the given PDF document.
 void setCasNumberLookup(boolean value)
          Enable or disable the lookup of CAS numbers (requires network access).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

propertySourceDocument

public static final java.lang.String propertySourceDocument
See Also:
Constant Field Values

propertyPage

public static final java.lang.String propertyPage
See Also:
Constant Field Values
Constructor Detail

DocumentExtractor

public DocumentExtractor()
Creates a new document extractor.


DocumentExtractor

public DocumentExtractor(java.io.File document)
                  throws java.io.IOException
If the file name ends with ".gz", the content will be uncompressed automatically. The getEstimatedTotalCharacters() will be wrong in this case though (to be implemented, request if needed).

Throws:
java.io.IOException

DocumentExtractor

public DocumentExtractor(java.io.File document,
                         java.lang.String encoding)
                  throws java.io.IOException
If the file name ends with ".gz", the content will be uncompressed automatically. The getEstimatedTotalCharacters() will be wrong in this case though (to be implemented, request if needed).

Throws:
java.io.IOException

DocumentExtractor

public DocumentExtractor(java.net.URL document)
                  throws java.io.IOException
Throws:
java.io.IOException

DocumentExtractor

public DocumentExtractor(java.net.URLConnection document)
                  throws java.io.IOException
Create a document extractor for the given URL connection. This constructor is also useful when using a java.net.Proxy class, by using the URL.openConnection(java.net.Proxy) method to obtain the URLConnection.

Throws:
java.io.IOException

DocumentExtractor

public DocumentExtractor(java.io.Reader r)

DocumentExtractor

public DocumentExtractor(java.lang.String text)
Extract structures from a String.

Since:
5.8
Method Detail

setCasNumberLookup

public void setCasNumberLookup(boolean value)
Enable or disable the lookup of CAS numbers (requires network access). Disabled by default.


acceptElements

public void acceptElements(boolean on)

acceptIons

public void acceptIons(boolean on)

acceptGenericNames

public void acceptGenericNames(boolean on)
Whether to accept generic, frequent names like "water".


main

public static void main(java.lang.String[] args)
Expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.


printEncodingError

public static void printEncodingError()

processPlainText

public void processPlainText(java.io.Reader r)
                      throws java.io.IOException
Extract names from a plain text document. Buffering is done internally, so passing a BufferedReader is not necessary.

Throws:
java.io.IOException

processPlainText

public void processPlainText()
                      throws java.io.IOException
Extract names from a plain text document.

Throws:
java.io.IOException

processPlainText

public void processPlainText(DocumentExtractor.ProgressListener progressListener)
                      throws java.io.IOException
Extract names from a plain text document. Buffering is done internally, so passing a BufferedReader is not necessary.

Throws:
java.io.IOException

processHTML

public void processHTML(java.io.Reader r)
                 throws java.io.IOException
Extract names from an HTML document. Buffering is done internally, so passing a BufferedReader is not necessary.

Throws:
java.io.IOException

processHTML

public void processHTML()
                 throws java.io.IOException
Extract names from an HTML document. Buffering is done internally, so passing a BufferedReader is not necessary.

Throws:
java.io.IOException

processHTML

public void processHTML(DocumentExtractor.ProgressListener progressListener)
                 throws java.io.IOException
Extract names from an HTML document. Buffering is done internally, so passing a BufferedReader is not necessary.

Throws:
java.io.IOException

getHits

public java.util.List<DocumentExtractor.Hit> getHits()
Returns the hits found in the documents processed so far.


clearHits

public void clearHits()
Clears the list of hits.


readPDF

public static DocumentExtractor readPDF(java.io.File pdf)
                                 throws java.io.IOException
Creates a DocumentExtractor to process the given PDF document.

Throws:
java.io.IOException

readPDF

public static DocumentExtractor readPDF(java.io.InputStream pdfStream)
                                 throws java.io.IOException
Creates a DocumentExtractor to process the given PDF document.

Throws:
java.io.IOException