Skip to content

Data Loading & Storage

Every data science project starts with data. Flow-Like provides comprehensive tools for loading data from various sources, storing it efficiently, and managing your data assets.

CSV is the most common data format. Flow-Like offers two approaches:

For smaller files, read the entire contents:

Read to String
├── Path: (FlowPath to CSV)
└── Content ──▶ (string with CSV data)

For large files, stream data in chunks to avoid memory issues:

Buffered CSV Reader
├── Path: (FlowPath to CSV)
├── Chunk Size: 10000 (rows per batch)
├── Delimiter: ","
├── On Chunk ──▶ (triggers for each batch)
├── Chunk ──▶ (current batch data)
└── Done ──▶ (file fully processed)

When to use each:

ApproachFile SizeMemory UsageUse Case
Read to String< 50MBHighQuick analysis, small datasets
Buffered ReaderAny sizeControlledETL pipelines, large datasets

Flow-Like provides comprehensive Excel support:

NodePurpose
Get Sheet NamesList all sheets in a workbook
Get RowRead a specific row
Loop RowsIterate through all rows
Read CellRead a specific cell value

The Try Extract Tables node automatically detects tables in Excel:

Try Extract Tables
├── Path: (FlowPath to Excel)
├── Min Table Cells: 4
├── Max Header Rows: 3
├── Drop Totals: true
├── Group Similar Headers: true
└── Tables ──▶ (array of detected tables)

This is powerful for messy spreadsheets with:

  • Multiple tables per sheet
  • Headers spanning multiple rows
  • Merged cells
  • Total/summary rows
Get Sheet Names ──▶ For Each Sheet ──▶ Try Extract Tables ──▶ Process
│ │ │
│ │ └── tables array
│ └── sheet name
└── ["Sheet1", "Data", "Summary"]

For structured JSON, validate against a schema:

Parse with Schema
├── JSON: (JSON string)
├── Schema: (JSON Schema definition)
├── Valid ──▶ (parsing succeeded)
├── Result ──▶ (parsed object)
└── Invalid ──▶ (validation failed)

The Repair Parse node fixes common JSON issues:

Repair Parse
├── Input: "{name: 'John', age: 30}" (invalid JSON)
└── Result ──▶ {"name": "John", "age": 30} (fixed)

Handles:

  • Unquoted keys
  • Single quotes
  • Trailing commas
  • Missing brackets

Parquet is ideal for large analytical datasets:

Mount Parquet to DataFusion
├── Path: (FlowPath to .parquet)
├── Table Name: "analytics"
└── Session ──▶ (DataFusion session with table)

Then query with SQL:

SELECT * FROM analytics WHERE date > '2025-01-01'

Every Flow-Like app has dedicated storage for files and databases.

  1. Go to your app’s Storage section
  2. Click Upload or drag-and-drop files
  3. Files are now accessible via FlowPath

FlowPath is Flow-Like’s unified path system:

Path TypeExampleDescription
App Storagestorage://data/sales.csvFiles in your app’s storage
Temptemp://processing/output.csvTemporary files (cleared on restart)
Absolute/Users/me/data.csvLocal filesystem (desktop only)
Make FlowPath
├── Scheme: "storage"
├── Path: "data/sales.csv"
└── Path ──▶ (FlowPath object)

Flow-Like includes LanceDB, a vector database for storing structured data:

Open Database
├── Name: "my_dataset"
└── Database ──▶ (connection reference)

Single Record:

Insert
├── Database: (connection)
├── Data: {"name": "John", "age": 30, "city": "NYC"}
└── End

Batch Insert:

Batch Insert
├── Database: (connection)
├── Values: [array of records]
└── End

From CSV:

Batch Insert CSV
├── Database: (connection)
├── CSV: (CSVTable data)
└── End
NodePurposeUse Case
FilterSQL WHERE clauseExact matches, ranges
ListPaginated listingBrowse all data
Vector SearchSimilarity searchFind similar items
FTS SearchFull-text searchKeyword matching
Hybrid SearchVector + FTSBest of both
Filter Database
├── Database: (connection)
├── SQL Filter: "age > 25 AND city = 'NYC'"
├── Limit: 100
└── Results ──▶ (matching records)
NodePurpose
IndexCreate indexes for faster queries
OptimizeCompact and optimize storage
PurgeRemove deleted records permanently
Get SchemaInspect table structure
CountGet record count

Connect to cloud object stores:

S3 Store
├── Bucket: "my-data-bucket"
├── Region: "us-east-1"
├── Access Key: (secret)
├── Secret Key: (secret)
└── Store ──▶ (object store connection)

Supported Providers:

  • AWS S3
  • Azure Blob Storage
  • Google Cloud Storage
  • S3-compatible (MinIO, etc.)

Flow-Like connects to popular services:

ServiceCapabilities
GitHubClone repos, issues, PRs, releases
NotionPages, databases, search
ConfluencePages, spaces, comments
Google WorkspaceSheets, Drive, Calendar
Microsoft 365Excel, OneDrive, SharePoint
DatabricksQuery Databricks tables

Connect directly to databases for federated queries:

Register PostgreSQL
├── Host: "db.example.com"
├── Port: 5432
├── Database: "analytics"
├── User: (secret)
├── Password: (secret)
├── Table: "transactions"
├── Alias: "txns"
└── Session ──▶ (DataFusion session)

Now query with SQL: SELECT * FROM txns WHERE amount > 1000

NodePurpose
CopyDuplicate a file
RenameChange file name
DeleteRemove a file
ExistsCheck if file exists
List PathsList directory contents
Sign URLGenerate temporary download URL

Write String:

Write String
├── Path: (FlowPath)
├── Content: "CSV data..."
└── End

Write Bytes:

Write Bytes
├── Path: (FlowPath)
├── Bytes: (binary data)
└── End

For streaming reads, balance memory vs. performance:

  • Small chunks (1000): Low memory, slower
  • Large chunks (50000): Fast, more memory

Create indexes on columns you filter frequently:

Index Database
├── Database: (connection)
├── Columns: ["user_id", "date"]

Convert large CSVs to Parquet for:

  • Faster queries (columnar)
  • Better compression
  • Type preservation
storage://
├── raw/ # Original files
├── processed/ # Cleaned data
├── models/ # Trained ML models
└── exports/ # Output files

Always check for file existence before reading:

Exists ──▶ Branch ──▶ Read File
└── (File not found) ──▶ Error handling
  • Check the FlowPath scheme (storage://, temp://, etc.)
  • Verify the file was uploaded to app storage
  • Check for typos in the path
  • Use Buffered CSV Reader with smaller chunk sizes
  • Process data incrementally instead of loading all at once
  • Consider converting to Parquet format
  • Check delimiter settings (comma vs. semicolon)
  • Verify encoding (UTF-8 is recommended)
  • Look for unquoted special characters in data

With your data loaded, continue to: