|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectchemaxon.naming.DocumentExtractor
public class DocumentExtractor
Extracts chemical names from text documents and converts them to chemical structures. Example usage:
// We have a document to process java.io.Reader document = ...; DocumentExtractor x = new DocumentExtractor(); x.processHTML(document); // or processPlainText(document) for input in plain text format // Iterate through the hits (using a Java 1.5 feature, otherwise use an java.util.Iterator) for (Hit hit : x.getHits()) { System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles")); }The field hit.position contains the position of the first character of the name in the document.
Note that hit.text contains the name as it appears in the source document. A cleaned version
(of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName()
.
This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.
Nested Class Summary | |
---|---|
class |
DocumentExtractor.Hit
An occurrence of a chemical name in the processed document. |
static class |
DocumentExtractor.ProgressInfo
|
static interface |
DocumentExtractor.ProgressListener
|
Field Summary | |
---|---|
static java.lang.String |
propertyPage
|
static java.lang.String |
propertySourceDocument
|
Constructor Summary | |
---|---|
DocumentExtractor()
Creates a new document extractor. |
|
DocumentExtractor(java.io.File document)
If the file name ends with ".gz", the content will be uncompressed automatically. |
|
DocumentExtractor(java.io.File document,
java.lang.String encoding)
If the file name ends with ".gz", the content will be uncompressed automatically. |
|
DocumentExtractor(java.io.Reader r)
|
|
DocumentExtractor(java.lang.String text)
Extract structures from a String. |
|
DocumentExtractor(java.net.URL document)
|
|
DocumentExtractor(java.net.URLConnection document)
Create a document extractor for the given URL connection. |
Method Summary | |
---|---|
void |
acceptElements(boolean on)
|
void |
acceptGenericNames(boolean on)
Whether to accept generic, frequent names like "water". |
void |
acceptIons(boolean on)
|
void |
clearHits()
Clears the list of hits. |
java.util.List<DocumentExtractor.Hit> |
getHits()
Returns the hits found in the documents processed so far. |
static void |
main(java.lang.String[] args)
Expects the name of a plain text file as the first argument (or from the standard input when absent). |
static void |
printEncodingError()
|
void |
processHTML()
Extract names from an HTML document. |
void |
processHTML(DocumentExtractor.ProgressListener progressListener)
Extract names from an HTML document. |
void |
processHTML(java.io.Reader r)
Extract names from an HTML document. |
void |
processPlainText()
Extract names from a plain text document. |
void |
processPlainText(DocumentExtractor.ProgressListener progressListener)
Extract names from a plain text document. |
void |
processPlainText(java.io.Reader r)
Extract names from a plain text document. |
static DocumentExtractor |
readPDF(java.io.File pdf)
Creates a DocumentExtractor to process the given PDF document. |
static DocumentExtractor |
readPDF(java.io.InputStream pdfStream)
Creates a DocumentExtractor to process the given PDF document. |
void |
setCasNumberLookup(boolean value)
Enable or disable the lookup of CAS numbers (requires network access). |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String propertySourceDocument
public static final java.lang.String propertyPage
Constructor Detail |
---|
public DocumentExtractor()
public DocumentExtractor(java.io.File document) throws java.io.IOException
java.io.IOException
public DocumentExtractor(java.io.File document, java.lang.String encoding) throws java.io.IOException
java.io.IOException
public DocumentExtractor(java.net.URL document) throws java.io.IOException
java.io.IOException
public DocumentExtractor(java.net.URLConnection document) throws java.io.IOException
java.io.IOException
public DocumentExtractor(java.io.Reader r)
public DocumentExtractor(java.lang.String text)
Method Detail |
---|
public void setCasNumberLookup(boolean value)
public void acceptElements(boolean on)
public void acceptIons(boolean on)
public void acceptGenericNames(boolean on)
public static void main(java.lang.String[] args)
public static void printEncodingError()
public void processPlainText(java.io.Reader r) throws java.io.IOException
java.io.IOException
public void processPlainText() throws java.io.IOException
java.io.IOException
public void processPlainText(DocumentExtractor.ProgressListener progressListener) throws java.io.IOException
java.io.IOException
public void processHTML(java.io.Reader r) throws java.io.IOException
java.io.IOException
public void processHTML() throws java.io.IOException
java.io.IOException
public void processHTML(DocumentExtractor.ProgressListener progressListener) throws java.io.IOException
java.io.IOException
public java.util.List<DocumentExtractor.Hit> getHits()
public void clearHits()
public static DocumentExtractor readPDF(java.io.File pdf) throws java.io.IOException
java.io.IOException
public static DocumentExtractor readPDF(java.io.InputStream pdfStream) throws java.io.IOException
java.io.IOException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |