Machine Learning
Flow-Like includes a complete machine learning toolkit built on linfa (Rust’s ML library) and ONNX Runtime for neural network inference. Train models visually without writing code.
Available Algorithms
Section titled “Available Algorithms”Classification
Section titled “Classification”| Algorithm | Node | Best For |
|---|---|---|
| Decision Tree | Fit Decision Tree | Interpretable rules, multi-class |
| Naive Bayes | Fit Naive Bayes | Fast baseline, Gaussian features |
| SVM | Fit SVM Multi-Class | High accuracy, complex boundaries |
Regression
Section titled “Regression”| Algorithm | Node | Best For |
|---|---|---|
| Linear Regression | Fit Linear Regression | Continuous predictions, feature importance |
Clustering
Section titled “Clustering”| Algorithm | Node | Best For |
|---|---|---|
| K-Means | Fit KMeans | Known cluster count, spherical clusters |
| DBSCAN | Fit DBSCAN | Unknown cluster count, outlier detection |
Dimensionality Reduction
Section titled “Dimensionality Reduction”| Algorithm | Node | Best For |
|---|---|---|
| PCA | Fit PCA | Feature reduction, visualization prep |
Deep Learning (ONNX)
Section titled “Deep Learning (ONNX)”| Model Type | Node | Best For |
|---|---|---|
| Image Classification | ONNX TIMM | Classify images |
| Object Detection | ONNX YOLO/D-FINE | Detect objects in images |
| Teachable Machine | Teachable Machine | Quick prototyping |
Data Preparation
Section titled “Data Preparation”Input Format
Section titled “Input Format”ML nodes expect data in a LanceDB database with:
- A
recordscolumn: 2D float array (feature matrix) - A
targetscolumn: labels (classification) or values (regression)
Preparing Your Data
Section titled “Preparing Your Data”┌────────────────────────────────────────────────────────────┐│ ││ 1. Load Data (CSV, SQL, etc.) ││ │ ││ ▼ ││ 2. Insert into Database ││ │ ││ ▼ ││ 3. Format as records/targets ││ │ ││ ▼ ││ 4. Split (train/test) ││ │ ││ ▼ ││ 5. Train Model ││ │└────────────────────────────────────────────────────────────┘Dataset Splitting
Section titled “Dataset Splitting”Random Split:
Split Dataset │ ├── Database: (input data) ├── Split Ratio: 0.8 (80% train, 20% test) │ ├── Train ──▶ (training database) └── Test ──▶ (test database)Stratified Split (preserves class distribution):
Stratified Split │ ├── Database: (input data) ├── Target Column: "label" ├── Split Ratio: 0.8 │ ├── Train ──▶ (balanced training set) └── Test ──▶ (balanced test set)Other Data Operations
Section titled “Other Data Operations”| Node | Purpose |
|---|---|
| Shuffle Dataset | Randomize row order |
| Sample Dataset | Take a random subset |
Classification Models
Section titled “Classification Models”Decision Tree
Section titled “Decision Tree”Decision trees create interpretable if-then rules:
Fit Decision Tree │ ├── Database: (training data) ├── Max Depth: 10 (0 = unlimited) ├── Min Samples Split: 2 │ └── Model ──▶ (trained decision tree)When to use:
- You need to explain predictions
- Data has clear decision boundaries
- Multi-class classification
Parameters:
| Parameter | Effect | Recommendation |
|---|---|---|
| Max Depth | Tree complexity | Start with 5-10, increase if underfitting |
| Min Samples Split | Minimum samples to split | Higher values prevent overfitting |
Naive Bayes
Section titled “Naive Bayes”Fast Gaussian classifier:
Fit Naive Bayes │ ├── Database: (training data) │ └── Model ──▶ (trained Naive Bayes)When to use:
- Quick baseline model
- Features are roughly Gaussian
- Fast inference needed
Pros/Cons:
| Pros | Cons |
|---|---|
| Very fast training | Assumes feature independence |
| Works with small datasets | Less accurate than trees/SVM |
| Handles multi-class naturally | Sensitive to feature scaling |
SVM (Support Vector Machine)
Section titled “SVM (Support Vector Machine)”High-accuracy classifier with RBF kernel:
Fit SVM Multi-Class │ ├── Database: (training data) │ └── Model ──▶ (trained SVM ensemble)When to use:
- Maximum accuracy needed
- Smaller datasets (< 10,000 samples)
- Complex decision boundaries
Notes:
- Uses One-vs-All strategy for multi-class
- Gaussian (RBF) kernel by default
- Slower training than trees/Naive Bayes
Regression Models
Section titled “Regression Models”Linear Regression
Section titled “Linear Regression”Predict continuous values:
Fit Linear Regression │ ├── Database: (training data with numeric targets) │ └── Model ──▶ (trained linear model)When to use:
- Predicting continuous values
- Understanding feature importance
- Linear relationship expected
Getting Coefficients:
Get Linear Coefficients │ ├── Model: (trained linear regression) │ └── Info ──▶ { coefficients: [0.5, -0.3, 0.8], intercept: 2.1, n_features: 3 }Clustering Models
Section titled “Clustering Models”K-Means
Section titled “K-Means”Partition data into k clusters:
Fit KMeans │ ├── Database: (data with records column) ├── Clusters: 5 (number of clusters) │ └── Model ──▶ (trained KMeans)When to use:
- You know the number of clusters
- Clusters are roughly spherical
- Customer segmentation, grouping
Getting Centroids:
Get KMeans Centroids │ ├── Model: (trained KMeans) │ └── Info ──▶ { k: 5, dimensions: 3, centroids: [[...], [...], ...] }DBSCAN
Section titled “DBSCAN”Density-based clustering:
Fit DBSCAN │ ├── Database: (data with records column) ├── Epsilon: 0.5 (max distance between points) ├── Min Points: 5 (points to form dense region) │ ├── End ──▶ (clustering complete) ├── N Clusters ──▶ (number found) └── N Noise ──▶ (outliers found)When to use:
- Unknown number of clusters
- Need to detect outliers/anomalies
- Non-spherical cluster shapes
Dimensionality Reduction
Section titled “Dimensionality Reduction”PCA (Principal Component Analysis)
Section titled “PCA (Principal Component Analysis)”Reduce feature dimensions:
Fit PCA │ ├── Database: (high-dimensional data) ├── N Components: 2 (target dimensions) ├── Output Column: "reduced" │ ├── End ──▶ (reduction complete) └── Vectors ──▶ (reduced vectors)When to use:
- Too many features (high-dimensional data)
- Preparing for visualization (reduce to 2-3D)
- Removing noise/redundant features
Making Predictions
Section titled “Making Predictions”The Predict node works with any trained model:
Predict on Database
Section titled “Predict on Database”Predict │ ├── Model: (any trained ML model) ├── Mode: "Database" ├── Database: (data to predict) ├── Input Column: "records" ├── Output Column: "predictions" ├── Batch Size: 5000 │ ├── End ──▶ (predictions complete) └── Database ──▶ (with predictions column)Predict on Vector
Section titled “Predict on Vector”For single predictions:
Predict │ ├── Model: (trained model) ├── Mode: "Vector" ├── Vector: [1.5, 2.3, 0.8, ...] (features) │ └── Prediction ──▶ "class_a" (or numeric value)Model Evaluation
Section titled “Model Evaluation”Classification Metrics
Section titled “Classification Metrics”Accuracy:
Evaluate Accuracy │ ├── Database: (with predictions & targets) ├── Prediction Column: "predictions" ├── Target Column: "targets" │ └── Result ──▶ { accuracy: 0.92, correct: 920, total: 1000 }Confusion Matrix:
Evaluate Confusion Matrix │ ├── Database: (with predictions & targets) ├── Prediction Column: "predictions" ├── Target Column: "targets" │ └── Result ──▶ { matrix: [[45, 5], [3, 47]], precision: [0.94, 0.90], recall: [0.90, 0.94], f1_score: [0.92, 0.92] }Regression Metrics
Section titled “Regression Metrics”Evaluate Regression │ ├── Database: (with predictions & targets) ├── Prediction Column: "predictions" ├── Target Column: "targets" │ └── Result ──▶ { mse: 0.05, rmse: 0.22, mae: 0.18, r_squared: 0.89 }Metric Guide:
| Metric | Description | Good Value |
|---|---|---|
| MSE | Mean Squared Error | Lower is better |
| RMSE | Root MSE (same units as target) | Lower is better |
| MAE | Mean Absolute Error | Lower is better |
| R² | Variance explained | Closer to 1.0 |
Saving and Loading Models
Section titled “Saving and Loading Models”Save Model
Section titled “Save Model”Save ML Model │ ├── Model: (trained model) ├── Path: (FlowPath for output) │ └── EndFormats:
- JSON – Human-readable, portable
- Binary – Faster, smaller (Fory format)
Load Model
Section titled “Load Model”Load ML Model │ ├── Path: (FlowPath to saved model) │ └── Model ──▶ (loaded model ready for predictions)ONNX Models (Deep Learning)
Section titled “ONNX Models (Deep Learning)”For pre-trained neural networks:
Loading ONNX Models
Section titled “Loading ONNX Models”Load ONNX │ ├── Path: (FlowPath to .onnx file) │ └── Session ──▶ (ONNX inference session)Image Classification (TIMM)
Section titled “Image Classification (TIMM)”Use models exported from PyTorch Image Models:
ONNX Classification │ ├── Session: (ONNX session) ├── Image: (image data) ├── Top K: 5 │ └── Results ──▶ [ {class_idx: 281, score: 0.92}, {class_idx: 282, score: 0.05}, ... ]Object Detection (YOLO/D-FINE)
Section titled “Object Detection (YOLO/D-FINE)”Detect objects in images:
ONNX Detection │ ├── Session: (ONNX session) ├── Image: (image data) ├── Confidence: 0.5 ├── NMS Threshold: 0.4 │ └── Detections ──▶ [ {class_idx: 0, score: 0.95, x1: 10, y1: 20, x2: 100, y2: 150}, ... ]Teachable Machine
Section titled “Teachable Machine”For Google Teachable Machine models:
Teachable Machine │ ├── Path: (FlowPath to .tflite) ├── Labels: (optional labels file) ├── Image: (image data) │ └── Results ──▶ [{label: "cat", score: 0.95}, ...]Model Selection Guide
Section titled “Model Selection Guide”| Use Case | Recommended Model |
|---|---|
| Quick classification baseline | Naive Bayes |
| Need to explain predictions | Decision Tree |
| Maximum accuracy (small data) | SVM |
| Predict continuous values | Linear Regression |
| Group data (known K) | K-Means |
| Find outliers & groups | DBSCAN |
| Reduce dimensions | PCA |
| Classify images | ONNX (TIMM) |
| Detect objects | ONNX (YOLO) |
| Imbalanced classes | Use Stratified Split first |
Complete Example: Customer Churn Prediction
Section titled “Complete Example: Customer Churn Prediction”┌────────────────────────────────────────────────────────────┐│ ││ Load CSV (customer data) ││ │ ││ ▼ ││ Insert to Database ││ │ ││ ▼ ││ Stratified Split (80/20) ││ │ ││ ├──▶ Train Set ──▶ Fit Decision Tree ││ │ │ ││ │ ▼ ││ │ Model ────────────┐ ││ │ │ ││ └──▶ Test Set ────────────────────────┼──▶ Predict ││ │ │ ││ │ ▼ ││ Confusion Matrix ││ │ ││ ▼ ││ Save Model (if good) ││ │└────────────────────────────────────────────────────────────┘Best Practices
Section titled “Best Practices”1. Always Split Your Data
Section titled “1. Always Split Your Data”Never evaluate on training data—it gives overly optimistic results.
2. Start Simple
Section titled “2. Start Simple”Begin with Naive Bayes or Decision Trees, then try more complex models.
3. Use Stratified Splitting for Classification
Section titled “3. Use Stratified Splitting for Classification”Especially important when classes are imbalanced.
4. Check Feature Scaling
Section titled “4. Check Feature Scaling”Some algorithms (SVM, K-Means) are sensitive to feature scales. Consider normalizing.
5. Evaluate Multiple Metrics
Section titled “5. Evaluate Multiple Metrics”Accuracy alone can be misleading. Check precision, recall, and F1.
6. Save Good Models
Section titled “6. Save Good Models”Don’t retrain every time—save and load trained models.
Troubleshooting
Section titled “Troubleshooting””Model performs poorly”
Section titled “”Model performs poorly””- Check for data quality issues
- Try a different algorithm
- Increase training data
- Check for class imbalance
”Training is slow”
Section titled “”Training is slow””- Reduce dataset size with sampling
- Use smaller batch sizes
- Try simpler algorithms (Naive Bayes)
“Memory errors”
Section titled ““Memory errors””- Set MAX_RECORDS limit
- Process in batches
- Use sampling for very large datasets
Next Steps
Section titled “Next Steps”With trained models:
- Data Visualization – Visualize predictions and metrics
- AI-Powered Analysis – Combine ML with GenAI
- Data Loading – Work with more data sources