The EXTRACTTEXT workflow application extracts text content from an input file (.pdf, .docx, .txt, .xml, .htm, .html, or .md; see note below on file format support) and returns the extracted text and its length. It supports optional parameters for maximum file size, trimming, and text normalization (Unix-style line breaks).
Notes:
- EXTRACTTEXT is available as of WorkflowGen version 10.0.0 (v10 Preview 1).
- Support for
.xml,.htm,.html, and.mdfiles is available as of WorkflowGen version 10.2.0.
Required parameters
| Parameter | Type | Direction | Description |
|---|---|---|---|
FILE |
FILE | IN | The file from which to extract the text (must be .pdf, .docx, .txt, .xml, .htm, .html, or .md) |
TEXT |
TEXT | OUT | The extracted (and possibly normalized/trimmed) text |
LENGTH |
NUMERIC | OUT | The length (number of characters) of the extracted text |
Optional parameters
| Parameter | Type | Direction | Description |
|---|---|---|---|
MAX_FILE_SIZE |
NUMERIC | IN | Maximum allowed file size in MB |
TRIM_SIZE |
NUMERIC | IN | Maximum number of characters to keep from the extracted text |
NORMALIZE |
TEXT | IN | Whether to normalize line endings Possible values: Y, N, true, false |