Document Processing
Flow-Like provides powerful document processing capabilities—extract text from PDFs, process Excel files, batch-transform documents, and use AI for intelligent extraction.
Supported Document Types
Section titled “Supported Document Types”| Format | Capabilities |
|---|---|
| Page count, render to images, text extraction | |
| Excel (.xlsx) | Read/write cells, manage worksheets, extract tables |
| CSV | Stream reading, database conversion, SQL queries |
| Images | OCR, resize, crop, rotate, convert formats |
| Word (.docx) | Text extraction |
| HTML | Convert to Markdown, extract content |
PDF Processing
Section titled “PDF Processing”Get Page Count
Section titled “Get Page Count”PDF Page Count (file_path) │ ▼Number: 42Render Pages as Images
Section titled “Render Pages as Images”Process each page visually:
PDF To Images (file_path) │ ▼Array<Image>: [page1.png, page2.png, ...]Or render a specific page:
PDF Page To Image├── file: document.pdf├── page: 1 (1-based)└── scale: 2.0 (for high resolution) │ ▼Image (PNG)Extract Text with AI
Section titled “Extract Text with AI”For complex PDFs with mixed layouts:
AI Extract Document├── file: complex_document.pdf├── model: GPT-4 Vision└── extract_images: true │ ▼Markdown text with structure preservedWhat AI extraction handles:
- Multi-column layouts
- Tables and charts
- Handwritten text
- Mixed text and images
- Scanned documents
Example: Invoice Processing Pipeline
Section titled “Example: Invoice Processing Pipeline”Quick Action Event (pdf_files: Array<Path>) │ ▼For Each pdf_file │ ▼AI Extract Document │ ▼Extract Knowledge (Invoice Schema)├── vendor: String├── invoice_number: String├── date: Date├── line_items: Array<{description, quantity, price}>└── total: Number │ ▼Insert to Database ──▶ Return summaryExcel Processing
Section titled “Excel Processing”Read/Write Cells
Section titled “Read/Write Cells”Excel Read Cell├── file: report.xlsx├── sheet: "Sales"└── cell: "B5" │ ▼Value: 45230.00Excel Write Cell├── file: report.xlsx├── sheet: "Sales"├── cell: "C5"└── value: "Processed"Manage Worksheets
Section titled “Manage Worksheets”Get Sheet Names (file) │ ▼["Sales", "Inventory", "Summary"]New Worksheet├── file: report.xlsx└── name: "Q4 Results"Copy Worksheet├── source: template.xlsx├── source_sheet: "Template"├── target: report.xlsx└── target_sheet: "January"Loop Through Rows
Section titled “Loop Through Rows”Process all rows in a worksheet:
For Each Row (file: data.xlsx, sheet: "Customers") │ ├── row.A ──▶ customer_id ├── row.B ──▶ name └── row.C ──▶ email │ ▼Process each customerExtract Tables Intelligently
Section titled “Extract Tables Intelligently”For messy Excel files with multiple tables:
AI Extract Tables├── file: messy_report.xlsx├── model: GPT-4└── Strategy determined by AI │ ▼Array of structured tablesThe AI:
- Analyzes the spreadsheet structure
- Identifies table boundaries
- Determines headers
- Extracts clean data
Microsoft 365 Excel
Section titled “Microsoft 365 Excel”Work with Excel files in OneDrive/SharePoint:
Microsoft Provider (OAuth) │ ▼List Excel Worksheets├── file_id: "abc123"└── site_id (optional for SharePoint) │ ▼Read Excel Range├── sheet: "Data"└── range: "A1:D100" │ ▼Array of rowsCSV Processing
Section titled “CSV Processing”Stream Large Files
Section titled “Stream Large Files”Process CSV files without loading everything into memory:
Buffered CSV Reader (large_file.csv) │ ▼For Each batch (1000 rows) │ ▼Process batch ──▶ Insert to databaseConvert to Database
Section titled “Convert to Database”Load CSV into queryable format:
Create Database (lance_db) │ ▼Load CSV (sales.csv) │ ▼SQL Query: "SELECT * FROM sales WHERE amount > 1000"DataFusion Integration
Section titled “DataFusion Integration”Query CSV files with SQL:
Create DataFusion Session │ ▼Register CSV ("sales", sales.csv) │ ▼Register CSV ("customers", customers.csv) │ ▼SQL Query:"SELECT c.name, SUM(s.amount) as total FROM sales s JOIN customers c ON s.customer_id = c.id GROUP BY c.name ORDER BY total DESC"Image Processing
Section titled “Image Processing”Read & Analyze
Section titled “Read & Analyze”Read Image (photo.jpg) │ ├── Image Dimensions ──▶ {width: 1920, height: 1080} │ └── AI Extract Document ──▶ Extracted text/contentTransform Images
Section titled “Transform Images”| Node | Description |
|---|---|
| Resize | Scale to specific dimensions |
| Crop | Extract region |
| Rotate | Rotate by degrees |
| Flip | Horizontal or vertical flip |
| Blur | Apply blur effect |
| Brighten | Adjust brightness |
| Contrast | Adjust contrast |
| Convert | Change format (PNG, JPG, WebP) |
Example: Prepare images for processing
Read Image │ ▼Resize (max_width: 1024) │ ▼Convert to PNG │ ▼AI AnalysisBarcode & QR Reading
Section titled “Barcode & QR Reading”Read Barcodes (image) │ ▼[{ type: "QR_CODE", data: "https://example.com/product/123", bounds: {x, y, width, height}}]Draw Annotations
Section titled “Draw Annotations”Add bounding boxes or annotations:
Draw Boxes├── image: document.png├── boxes: [{x, y, w, h, label: "Invoice Number"}]└── color: red │ ▼Annotated imageGenerate QR Codes
Section titled “Generate QR Codes”Write QR Code├── data: "https://myapp.com/verify/abc123"├── size: 256└── format: PNG │ ▼QR code imageText Extraction
Section titled “Text Extraction”HTML to Markdown
Section titled “HTML to Markdown”Clean up web content:
HTML to Markdown├── html: "<h1>Title</h1><p>Content...</p>"└── remove_tags: ["script", "style", "nav"] │ ▼"# Title\n\nContent..."Keyword Extraction
Section titled “Keyword Extraction”YAKE (Unsupervised):
YAKE Keywords├── text: document_content├── language: "en"└── max_keywords: 10 │ ▼["machine learning", "data processing", "automation", ...]RAKE (Rule-based):
RAKE Keywords├── text: document_content└── language: "en" │ ▼[{keyword: "artificial intelligence", score: 8.5}, ...]AI-Powered:
AI Keyword Extraction├── text: document_content└── model: GPT-4 │ ▼Semantically relevant keywordsBatch Processing
Section titled “Batch Processing”Process Folder of Documents
Section titled “Process Folder of Documents”Quick Action Event (folder_path) │ ▼List Paths (folder_path, pattern: "*.pdf") │ ▼For Each file_path │ ▼Detect file type │ ├── PDF ──▶ AI Extract Document ├── Excel ──▶ Extract Tables ├── Image ──▶ OCR Extract └── CSV ──▶ Load to Database │ ▼Store extracted data ──▶ Generate reportWatch Folder for New Files
Section titled “Watch Folder for New Files”Scheduled Event (every 5 minutes) │ ▼List Paths (/incoming, modified_after: last_run) │ ▼For Each new_file │ ▼Process document ──▶ Move to /processedAI-Powered Processing
Section titled “AI-Powered Processing”Structured Extraction
Section titled “Structured Extraction”Extract specific fields from any document:
AI Extract Document (document) │ ▼Extract Knowledge├── Schema:│ ├── company_name: String│ ├── document_type: Enum["invoice", "receipt", "contract"]│ ├── date: Date│ ├── total_amount: Number│ └── line_items: Array<{description, amount}>│└── Model: GPT-4 │ ▼Validated structured dataDocument Classification
Section titled “Document Classification”AI Classification├── document: extracted_text├── categories: ["Invoice", "Receipt", "Contract", "Report", "Letter"]└── model: GPT-4 │ ▼{ category: "Invoice", confidence: 0.95}Summarization
Section titled “Summarization”Invoke LLM├── prompt: "Summarize this document in 3 bullet points: {document_text}"└── model: GPT-4 │ ▼• Key point 1• Key point 2• Key point 3Template Processing
Section titled “Template Processing”Generate documents from templates:
Render Template├── template: "Dear {name},\n\nYour order #{order_id} has shipped..."├── variables:│ ├── name: "Alice"│ └── order_id: "12345" │ ▼"Dear Alice,\n\nYour order #12345 has shipped..."Jinja-style features:
- Variable interpolation:
{variable} - Conditionals:
{% if condition %}...{% endif %} - Loops:
{% for item in items %}...{% endfor %} - Filters:
{name|upper}
File Operations
Section titled “File Operations”Basic Operations
Section titled “Basic Operations”| Node | Description |
|---|---|
| Copy | Copy file to new location |
| Delete | Remove file |
| Rename | Rename/move file |
| Exists | Check if file exists |
| File Hash | Compute MD5/SHA hash |
Cloud Storage
Section titled “Cloud Storage”Work with files in cloud storage:
S3 / Azure / GCS Provider │ ▼List Files (bucket/container) │ ▼Download File │ ▼Process locally │ ▼Upload resultsSigned URLs
Section titled “Signed URLs”Generate temporary access URLs:
Sign URL├── path: "reports/quarterly.pdf"├── expiry: 3600 (seconds)└── provider: S3 │ ▼"https://bucket.s3.amazonaws.com/reports/quarterly.pdf?signature=..."Example Pipelines
Section titled “Example Pipelines”Invoice Processing
Section titled “Invoice Processing”Watch Folder (/invoices) │ ▼For Each new PDF │ ├──▶ AI Extract Document │ ├──▶ Extract Knowledge (Invoice Schema) │ ├──▶ Validate required fields │ │ │ ├── Valid ──▶ Insert to Database │ │ │ │ │ ▼ │ │ Create Approval Task │ │ │ └── Invalid ──▶ Move to /review │ └──▶ Move to /processedDocument Search System
Section titled “Document Search System”Ingest Pipeline:├── List all documents├── For Each document│ ├── Extract text (AI Extract Document)│ ├── Chunk into sections│ ├── Generate embeddings│ └── Insert to Vector DB│Query Pipeline:├── User search query├── Embed query├── Vector search (top 10)├── Return matching documents with snippetsReport Generation
Section titled “Report Generation”Scheduled Event (monthly) │ ▼Query Database (monthly stats) │ ▼Generate Charts (Nivo) │ ▼Render Template (report_template.md) │ ▼Convert to PDF │ ▼Email Report ──▶ ArchiveBest Practices
Section titled “Best Practices”1. Handle Encoding
Section titled “1. Handle Encoding”Always specify encoding for text files:
Read to String (file, encoding: "utf-8")2. Validate Before Processing
Section titled “2. Validate Before Processing”Check file type and size before heavy processing:
File Exists → Get File Size → Validate → Process3. Use Appropriate Extraction
Section titled “3. Use Appropriate Extraction”| Document Type | Best Approach |
|---|---|
| Clean PDF | Direct text extraction |
| Scanned PDF | AI vision OCR |
| Structured Excel | Cell/range reading |
| Messy Excel | AI table extraction |
| Mixed content | AI Extract Document |
4. Batch Wisely
Section titled “4. Batch Wisely”For large volumes, process in batches to manage memory:
For Each batch of 100 files │ ▼Process batch ──▶ Save results ──▶ Next batch5. Archive Originals
Section titled “5. Archive Originals”Keep original documents before processing:
Copy to /archive ──▶ Process ──▶ Store resultsNext Steps
Section titled “Next Steps”- Data Loading – Store extracted data
- DataFusion – Query processed data
- GenAI – Advanced AI extraction
- Building Internal Tools – Create document processing UIs