epic: Jan's path to cortex.cpp? #3690

dan-homebrew · 2024-09-17T12:00:09Z

Goal

Jan should be able to seamlessly move from Nitro to cortex.cpp
What is the scope of change?
- Different inference extensions? (e.g. nitro-extension, and cortex-extension?)
- Data Structures (old legacy folders, vs. new?)
- Separation of concerns (e.g. Jan used to be in charge of model downloads, now calls cortex.cpp instead?)
What is our strategy?
- Parallel: support both legacy and new
- Migration: move from old Nitro to new cortex.cpp?

Tasklist

Clearly articulate the architectural change that needs to happen
Clearly articulate the scope of changes we need to account for
Figure out our migration strategy

The text was updated successfully, but these errors were encountered:

louis-jan · 2024-09-19T07:06:28Z

Scope of changes

Nitro Inference Extension
Model Extension
Monitoring Extension

Nitro inference extension

Current implementation

Register Models (pre-populate model.json files)
Any extensions register models on load will pre-populate model.json under /models/[model-id]/model.json

sequenceDiagram
    participant ModelExtension
    participant BaseExtension
    participant FileSystem

    ModelExtension->>BaseExtension: Register Models
    BaseExtension->>BaseExtension: Pre-populate Data
    BaseExtension->>FileSystem: Write to /models

Load Model:
- Set additional .dll/.so PATH (for engine loading)
- Hardware Information (to decide engine binary)
- Run nitro server
- Parse prompt template
- Load a GGUF model with its file path and model settings (passed from App)

sequenceDiagram
    participant App
    participant NitroInferenceExtension
    participant NitroServer

    App->>NitroInferenceExtension: loadModel
    NitroInferenceExtension->>NitroInferenceExtension: killProcess
    NitroInferenceExtension->>NitroInferenceExtension: fetch hardware information
    NitroInferenceExtension->>child_process: spawn Nitro process
    NitroInferenceExtension->>NitroServer: wait for server healthy
    NitroInferenceExtension->>NitroInferenceExtension: parsePromptTemplate
    NitroInferenceExtension->>NitroServer: send loadModel request
    NitroInferenceExtension->>NitroServer: wait for model loaded

Inference (inheritance - OAIEngine.ts)
Any extensions inheriting from the Base OAI Engine class will forward requests to their respective inference endpoints.

sequenceDiagram
    participant App
    participant NitroInferenceExtension
    participant NitroServer

    App->>NitroInferenceExtension: inference
    NitroInferenceExtension->>NitroInferenceExtension: transform payload
    NitroInferenceExtension->>NitroServer: chat/completions

Possible Changes

Current ❌	Upcoming ✅
Run Nitro server on model load	Run cortex.cpp daemon service on start
Kill nitro process on pre-model-load and pre-app-exit	Keep cortex-cpp alive, daemon process, stop on exit
Heavy hardware detection & prompt processing	Just send a request
So many requests (check port, check health, model load status)	One request to do the whole thing
Mixing of model management and inference - Multiple responsibilities	Single responsibility

Model extension

Current implementation

Download Model (ModelFile as payload)
Delete Model (ModelFile as payload)
Get Models (Scan through models folder and return ModelFile[])
Import Model (Generate ModelFile and download)
Fetch HF Repo Data (for HF model import selection)

App retrieves pre-populated models:

sequenceDiagram

App ->> ModelExtension: get available models
ModelExtension ->> FS: read /models
FS --> ModelExtension : ModelFile

App downloads a model:

sequenceDiagram

App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress

App imports a model

sequenceDiagram

App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
ModelExtension ->> model.json :generate
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress

App deletes a model

graph LR

App --> |remove| Model_Extension
Model_Extension --> |FS unlink| /models/model/__files__

Possible Changes

Current ❌	Upcoming ✅
Implementation - Depends on FS	Abstraction - API Forwarding
List Available Models: Scan through Model Folder	GET /models
Delete: Unlink FS	DELETE /models
Download: Download	POST & Progress /models/pulls
Broken Model Import - Using default model.json	cortex.cpp handles the model metadata
Model prediction depends on model size & available RAM/VRAM only	cortex.cpp predicts base on hardware and model.yaml

System Monitoring extension

Current implementation

Get GPU Settings
Get System Information

App get resources information

graph LR

App --> |getResourcesInfo| Model_Extension
Model_Extension --> |fetch| node-os-utils
Model_Extension --> |getCurrentLoad| nvidia-smi

Possible Changes

Current ❌	Upcoming ✅
Implementation - Depends on FS & CMD	Abstraction - API Forwarding
Execute CMD	GET - Hardware Information Endpoint

Overview

Current ❌

Upcoming ✅

Assumption

cortex.cpp bundles multiple engines (different CPU instructions and CUDA versions)
cortex.cpp support /models APIs
- GET: /models (available, active status, compatibility prediction)
- POST: /models/pull (& progress?)
- DELETE: /models
cortex.cpp support /hardware-information API

Challenges of moving Nitro to cortex.cpp

Different Data (Folder & File) structures
Backward / Forward compatibility

The migration

How to seamlessly move from Nitro to cortex.cpp, where:
- cortex.cpp works with new Data Folder structure
- cortex.cpp works with model.yaml
- cortex.cpp works with models.list
How to maintain the data folder when users switch back to older versions?
- Older versions rely on model-extension, which searches for a model.json file within the Data Folder.
- Newer versions rely on cortex-extension, which searches for a model.yaml file within the Data Folder.

Let's think about a couple of our principles.

We don't remove or manipulate user data.
Rollback should always work.
Minimal migration

What are some of the main concerns here?

Can we use model.json and model.yaml side by side?
1. We should. Since the model folder can contain anything, from README.md, .gitignore, GGUF, model.yaml to model.json.
2. Older versions will still function with legacy model.json files.
3. Newer versions will work with the latest model.yaml files.
How to sync between those two?
1. It's hard to sync between those two, since different structures could break the app.
2. We just try to migrate once when there is no models.list available. This is a good flag for migration triggering.
3. After migrating, each app version works independently with its own model file format.
How about model pre-population? In other words, Model Hub.
1. Model pre-population is an anti-pattern. Pre-populated models do not work with versioning or create unwanted data that confuse users. How about our Model Hub list thousands of models?
2. We implemented model import, which replaces the need for a model file. Users can just import with the HF repo ID. Users do not have any reason to duplicate or edit a pre-populated model.json.
3. Model listing can be done from the extension.
4. In short, in the next version, we don't pre-populate unwanted files to the Data Folder. Only when users decide to download.
5. Users deleting a model means deleting the persisted model.yaml & model files.
How do other extensions work with their models? E.g., OpenAI
1. Remote models can be populated during the build, not persisted. registerModels now persists in-memory model DTO.
2. We don't pre-populate remote models, which is not necessary. Users are better setting them from Extension Settings. It's more or less an Extension configuration, not Model population.
Migration complexity and UX
1. We don't convert model.json to model.yaml. Instead, import with symlink. It could be faster and avoid new logic added from Jan, which is redundant. Lightweight migration with less risk. Maintain the Model ID is key; otherwise, all threads break.
2. We don't move any files, which could drag the migration process long. E.g., GGUF
3. How about new/manual adding GGUFs? The model symlink feature is always there for that.
4. There are bad migration experiences in the past that we can avoid such as:
  1. Migrate all pre-populated models
  2. Heavy file movement drags the duration long
  3. Migrate everything at once
5. Now we just migrate downloaded models:
  1. Import downloaded models only as symlinks (no file movement)
  2. Don't update the ID, which will kill us on data inconsistency
  3. Another thought: Do we really need to wait for model.yaml creation during migration?
    1. cortex.cpp can work with the models.list to provide available models?
    2. model.yaml generation is an asynchronous operation so:
      1. It generates model.yaml as soon as user try to get or load.
      2. It generates model.yaml as soon as user try to import.
      3. Don't block the client GUI; model list can be done with just the models.list contents. Any further operations with a certain model can generate a model.yaml later.
      4. Client will prioritize the active thread's model then others to not blocking users working threads.
      5. If something goes wrong, the GGUF file will still be there and can be generated later on other operations. The model.yaml file is not strictly required to be available, but just the cache of model file metadata?
Better cache mechanism
1. Model list and detail have worked with the File System before, and now they're sending an API request to cortex.cpp.
2. To prevent slow loading, the client should cache accordingly on the frontend.

Summary

In short, the entire migration process is just to create symlinks to downloaded models from models.list. No model.yaml or folder manipulation involved. It should be done instantly?

Migrate indicator: models.list exist.

Don't pre-populate models. Remote Extensions work with their own settings instead of pre-populated models. Cortex Extension registers in-memory available to pull models (templates).

cortex.cpp is a daemon process that should be started alongside the app.

Jan migration from 0.5.x to 0.6.0

louis-jan · 2024-09-19T07:19:27Z

Bundled Engines

Is it possible that, cortex-cpp bundles multiple engines, but expose only 1 gateway?

E.g.
The client requests to load a llama.cpp model, but cortex.cpp can predict the hardware compatible and run an efficient binary.

So:

Clients do not need to send any extra engine parameters or minimal (type).
Clients don't need to parse prompt templates, that's something the model should handle.
cortex.cpp owns the model metadata, allowing it to operate independently.
cortex.cpp masks up the complex binary distribution, exposing a simple interface.
GPU ON/OFF - GPU Selections can be done via engine /settings?

Eventually, that's all it needs to work with – the Model ID (aka model name).

Simplify `model load / chat completions` request

louis-jan · 2024-09-19T08:11:52Z

Incremental Path

We do what's not related to cortex.cpp first - Remote Extensions & Pre-populated Models
1. Rather than pre-populate, enhance the model configurations.
2. registerModels now lists available models for download, don't persist model.json.
Better data caching
1. Data retrieved from extensions should be cached on the frontend for subsequent loads.
2. Reduce direct API requests and perform more data synchronization operations.
3. Implementing a good cache layer would save a bad user experience during migration later, where the app doesn't need to scan through the models list, but can just dump cached data and imports right away. It won't interrupt users' working threads since asynchronous operations take care of data persistence (model.yaml), and model load requests are typically long-delayed responses.
Minimal Migration Steps (cortex-cpp ready)
1. Generate models.list based on cached data, do not need to scan the Model Folder, which can be costly.
2. Send model import or symlink requests to generate models.list. It would be great if cortex.cpp could support batch symlinks (import), as that would only require creating a models.list file. The model.yaml files can be generated asynchronously. (This would cover the case user edits the models.list manually)
3. Update extensions to redirect requests.
4. The worst-case scenario is when users update from significantly older versions that lack cache improvements. Go through model folders and send import requests. During app update.

sequenceDiagram
    participant App as "App"
    participant Models as "Models"
    participant ModelList as "Model List"
    participant ModelYaml as "Model YAML"

    App->>Models: import
    activate Models
    Models->>ModelList: update models.list
    activate ModelList
    ModelList->>Models: return data
    deactivate ModelList
    Models->>ModelYaml: generate (async)
    activate ModelYaml
    Note right of ModelYaml: generate model.yaml asynchronously
    ModelYaml->>Models: (async) generated
    deactivate ModelYaml
    deactivate Models

dan-homebrew changed the title ~~epic: Jan migration from Nitro to cortex.cpp~~ epic: Jan to start using cortex.cpp in addition to Nitro Sep 17, 2024

dan-homebrew changed the title ~~epic: Jan to start using cortex.cpp in addition to Nitro~~ epic: Jan's path to cortex.cpp? Sep 17, 2024

dan-homebrew mentioned this issue Sep 17, 2024

epic: Jan supports new Model Folder and model.yaml architecture #3633

Open

dan-homebrew assigned louis-jan Sep 18, 2024

imtuyethan added the P1: important Important feature / fix label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epic: Jan's path to cortex.cpp? #3690

epic: Jan's path to cortex.cpp? #3690

dan-homebrew commented Sep 17, 2024 •

edited by louis-jan

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading

epic: Jan's path to cortex.cpp? #3690

epic: Jan's path to cortex.cpp? #3690

Comments

dan-homebrew commented Sep 17, 2024 • edited by louis-jan Loading

Goal

Tasklist

louis-jan commented Sep 19, 2024 • edited Loading

Scope of changes

Nitro inference extension

Model extension

System Monitoring extension

Overview

Assumption

Challenges of moving Nitro to cortex.cpp

The migration

Summary

louis-jan commented Sep 19, 2024 • edited Loading

Bundled Engines

Eventually, that's all it needs to work with – the Model ID (aka model name).

louis-jan commented Sep 19, 2024 • edited Loading

Incremental Path

dan-homebrew commented Sep 17, 2024 •

edited by louis-jan

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading

louis-jan commented Sep 19, 2024 •

edited

Loading