DocumentExtractor (Marvin API documentation (c) 1998-2012 ChemAxon Ltd.)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

chemaxon.naming
Class DocumentExtractor

java.lang.Object
  chemaxon.naming.DocumentExtractor

public class DocumentExtractor
extends java.lang.Object
extends java.lang.Object

Extracts chemical names from text documents and converts them to chemical structures. Example usage:

// We have a document to process
java.io.Reader document = ...;

DocumentExtractor x = new DocumentExtractor();
x.processHTML(document); // or processPlainText(document) for input in plain text format

// Iterate through the hits (using a Java 1.5 feature, otherwise use an java.util.Iterator)
for (Hit hit : x.getHits()) {
  System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles"));
}

The field hit.position contains the position of the first character of the name in the document.

Note that hit.text contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName().

This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.

Author:: Daniel Bonniot

Nested Class Summary
`class`	`DocumentExtractor.Hit` An occurrence of a chemical name in the processed document.
`static class`	`DocumentExtractor.ProgressInfo`
`static interface`	`DocumentExtractor.ProgressListener`

Field Summary
`static java.lang.String`	`propertyPage`
`static java.lang.String`	`propertySourceDocument`

Constructor Summary
`DocumentExtractor()` Creates a new document extractor.
`DocumentExtractor(java.io.File document)` If the file name ends with ".gz", the content will be uncompressed automatically.
`DocumentExtractor(java.io.File document, java.lang.String encoding)` If the file name ends with ".gz", the content will be uncompressed automatically.
`DocumentExtractor(java.io.Reader r)`
`DocumentExtractor(java.lang.String text)` Extract structures from a String.
`DocumentExtractor(java.net.URL document)`
`DocumentExtractor(java.net.URLConnection document)` Create a document extractor for the given URL connection.

Method Summary
`void`	`acceptElements(boolean on)`
`void`	`acceptGenericNames(boolean on)` Whether to accept generic, frequent names like "water".
`void`	`acceptIons(boolean on)`
`void`	`clearHits()` Clears the list of hits.
`java.util.List<DocumentExtractor.Hit>`	`getHits()` Returns the hits found in the documents processed so far.
`static void`	`main(java.lang.String[] args)` Expects the name of a plain text file as the first argument (or from the standard input when absent).
`static void`	`printEncodingError()`
`void`	`processHTML()` Extract names from an HTML document.
`void`	`processHTML(DocumentExtractor.ProgressListener progressListener)` Extract names from an HTML document.
`void`	`processHTML(java.io.Reader r)` Extract names from an HTML document.
`void`	`processPlainText()` Extract names from a plain text document.
`void`	`processPlainText(DocumentExtractor.ProgressListener progressListener)` Extract names from a plain text document.
`void`	`processPlainText(java.io.Reader r)` Extract names from a plain text document.
`static DocumentExtractor`	`readPDF(java.io.File pdf)` Creates a DocumentExtractor to process the given PDF document.
`static DocumentExtractor`	`readPDF(java.io.InputStream pdfStream)` Creates a DocumentExtractor to process the given PDF document.
`void`	`setCasNumberLookup(boolean value)` Enable or disable the lookup of CAS numbers (requires network access).

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

propertySourceDocument

public static final java.lang.String propertySourceDocument

See Also:: Constant Field Values

propertyPage

public static final java.lang.String propertyPage

See Also:: Constant Field Values

Constructor Detail

DocumentExtractor

public DocumentExtractor()

Creates a new document extractor.

DocumentExtractor

public DocumentExtractor(java.io.File document)
                  throws java.io.IOException

If the file name ends with ".gz", the content will be uncompressed automatically. The getEstimatedTotalCharacters() will be wrong in this case though (to be implemented, request if needed).

Throws:: java.io.IOException

DocumentExtractor

public DocumentExtractor(java.io.File document,
                         java.lang.String encoding)
                  throws java.io.IOException

If the file name ends with ".gz", the content will be uncompressed automatically. The getEstimatedTotalCharacters() will be wrong in this case though (to be implemented, request if needed).

Throws:: java.io.IOException

DocumentExtractor

public DocumentExtractor(java.net.URL document)
                  throws java.io.IOException

Throws:: java.io.IOException

DocumentExtractor

public DocumentExtractor(java.net.URLConnection document)
                  throws java.io.IOException

Create a document extractor for the given URL connection. This constructor is also useful when using a java.net.Proxy class, by using the URL.openConnection(java.net.Proxy) method to obtain the URLConnection.

Throws:: java.io.IOException

DocumentExtractor

public DocumentExtractor(java.io.Reader r)

DocumentExtractor

public DocumentExtractor(java.lang.String text)

Extract structures from a String.

Since:: 5.8

Method Detail

setCasNumberLookup

public void setCasNumberLookup(boolean value)

Enable or disable the lookup of CAS numbers (requires network access). Disabled by default.

acceptElements

public void acceptElements(boolean on)

acceptIons

public void acceptIons(boolean on)

acceptGenericNames

public void acceptGenericNames(boolean on)

Whether to accept generic, frequent names like "water".

main

public static void main(java.lang.String[] args)

Expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.

printEncodingError

public static void printEncodingError()

processPlainText

public void processPlainText(java.io.Reader r)
                      throws java.io.IOException

Extract names from a plain text document. Buffering is done internally, so passing a BufferedReader is not necessary.

Throws:: java.io.IOException

processPlainText

public void processPlainText()
                      throws java.io.IOException

Extract names from a plain text document.

Throws:: java.io.IOException

processPlainText

public void processPlainText(DocumentExtractor.ProgressListener progressListener)
                      throws java.io.IOException

Extract names from a plain text document. Buffering is done internally, so passing a BufferedReader is not necessary.

Throws:: java.io.IOException

processHTML

public void processHTML(java.io.Reader r)
                 throws java.io.IOException

Extract names from an HTML document. Buffering is done internally, so passing a BufferedReader is not necessary.

Throws:: java.io.IOException

processHTML

public void processHTML()
                 throws java.io.IOException

Extract names from an HTML document. Buffering is done internally, so passing a BufferedReader is not necessary.

Throws:: java.io.IOException

processHTML

public void processHTML(DocumentExtractor.ProgressListener progressListener)
                 throws java.io.IOException

Extract names from an HTML document. Buffering is done internally, so passing a BufferedReader is not necessary.

Throws:: java.io.IOException

getHits

public java.util.List<DocumentExtractor.Hit> getHits()

Returns the hits found in the documents processed so far.

clearHits

public void clearHits()

Clears the list of hits.

readPDF

public static DocumentExtractor readPDF(java.io.File pdf)
                                 throws java.io.IOException

Creates a DocumentExtractor to process the given PDF document.

Throws:: java.io.IOException

readPDF

public static DocumentExtractor readPDF(java.io.InputStream pdfStream)
                                 throws java.io.IOException

Creates a DocumentExtractor to process the given PDF document.

Throws:: java.io.IOException

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

chemaxon.naming Class DocumentExtractor

propertySourceDocument

propertyPage

DocumentExtractor

DocumentExtractor

DocumentExtractor

DocumentExtractor

DocumentExtractor

DocumentExtractor

DocumentExtractor

setCasNumberLookup

acceptElements

acceptIons

acceptGenericNames

main

printEncodingError

processPlainText

processPlainText

processPlainText

processHTML

processHTML

processHTML

getHits

clearHits

readPDF

readPDF

chemaxon.naming
Class DocumentExtractor