Skip to main content

Model Training

Path: Artificial Intelligence --- Code Development and Training --- Model Training

Function Overview

The system supports the creation of training tasks, including:

  • Selecting a model from the model factory
  • Setting hyperparameters
  • Selecting a training dataset
  • Providing resource specifications to run the training task

When starting a model training task, the system supports distributed training.

During the training process, users can:

  • View task details, including the training task ID, training dataset, training logs, and other information
  • Start a visualization task to display training visualization information through TensorBoard or MindInsight, and support the retrieval of task operation logs
  • Utilize the experimental result comparison capabilities provided by the system

Create a Project

Before starting training and evaluation, a model training project needs to be created.

alt text

alt text

Train the Model

Create a Training Task

On the "Create Model Training" page, select a model from the model library and fill in the following parameters:

  • Model Name: Click the browse button and select the required model from the pop-up model list. After selection, the relevant information of this model will be automatically filled in, including tags, image, CPU architecture, conda environment, training visualization, and distributed training

  • Incremental Training: If the model supports incremental training, you can select the weight file of the output of an existing task as the pre-trained model for incremental training

  • Hyperparameter Configuration: You can use the default values for training or modify the parameter configuration as needed

  • Data Configuration: Select any dataset. If multiple datasets are selected, the algorithm code needs to adapt to them by itself

alt text

View Task Details

In the task list, click the task number of the task to view the training details, including the training log. Wait for the training status to change to "Completed", which means the model training is completed. After the training is finished, you can view and download the model files, including the weight file.

alt text

Resume Training from Checkpoint

For training tasks that are interrupted halfway, you can resume training from the latest checkpoint record. In the training task list, click the "More - Resume Training" button on the right side of the task, and the checkpoint resume training will start automatically.

alt text

It should be noted that not all models support resume training from checkpoint. The following conditions need to be met:

  • Output the model checkpoint data to the "Task Output Path" of the received platform parameters in the model code

  • Read the checkpoint data in this path during resume training

  • Turn on the "Resume Training from Checkpoint" switch in the model's training configuration

Evaluate the Model

For tasks that have completed model training, the model can be further evaluated.

alt text

Create an Evaluation Task

In the training list, click the "Evaluate" button on the right side of the training task to evaluate the model. Enter the "Create Model Evaluation" page and fill in the relevant parameters.

alt text

View Evaluation Task Details

Click the evaluation number to view the evaluation details, evaluation indicators, evaluation logs, and other information.

alt text

Training Visualization

Two ways of visualizing training tasks are supported: TensorBoard and MindInsight.

The following requirements need to be met for visualization:

  • The model algorithm needs to output the visualization content to the training task output path according to the visualization specifications of TensorBoard and MindInsight.

  • The model algorithm needs to receive the training task output path passed in the form of command-line parameters by the platform. And you need to edit the names of the command-line parameters received by the model algorithm in the training configuration of the model details in the model library.

  • When running the training task, click the start button in the "TensorBoard/MindInsight" column of the task list. After the status changes to running, click the open button to view the visualization content of TensorBoard or MindInsight in a new window.

alt text

alt text

alt text

Experimental Comparison

During the process, experimental comparisons can be made, including training comparisons after model training, evaluation comparisons, and debugging training comparisons after code development. It can compare the output indicators after multiple training and evaluation sessions and display the comparison results visually.

Monitor Comparison Indicators

Before making a comparison, you need to specify the comparison indicators in the model's algorithm code and track the output. Here, use the experimental data visualization SDK provided by the platform to specify the training and evaluation indicators for comparison.

Install the SDK

After entering the development environment, first install aim in the command line through JupyterLab or SSH.

Initialize

Write the experimental script.

Import the aim package and create a Run object. A Run object usually corresponds to a model training or evaluation experiment.

from aim import Run

run = Run()

Declare Hyperparameters

Record the experimental parameters for easy subsequent observation and analysis.

run['hparams'] = {

'learning_rate': 0.001,

'batch_size': 32,

}

Record Indicators

At the appropriate place in the training or evaluation code, call track to record the indicators. value is the indicator value, and name is the indicator name.

run.track(value, name='name')

Complete Example

from aim import Run

run = Run()

run['hparams'] = {

'learning_rate': 0.001,
'batch_size': 32,

}

... # Model training or evaluation code

run.track(value1, name='name1')

... # Model training or evaluation code

run.track(value2, name='name2')

Training Comparison

Open a model training project and complete several training sessions.

  • Click the "Training Comparison" button on the project details page, and a pop-up window for selecting training tasks will appear.

  • Select the training tasks to be compared in the pop-up window, add them to the target list on the right side of the shuttle box, and click the "Compare" button.

  • On the training comparison page, you can view the comparison of indicators.

  • On the training comparison page, click "Visual Comparison" to view the visual comparison of indicators.

Evaluation Comparison

Open a model training project and complete several evaluation sessions.

  • Click the "Evaluation Comparison" button on the project details page, and a pop-up window for selecting evaluation tasks will appear.

  • Select the evaluation tasks to be compared in the pop-up window, add them to the target list on the right side of the shuttle box, and click the "Compare" button.

  • On the evaluation comparison page, you can view the comparison of indicators.

  • On the evaluation comparison page, click "Visual Comparison" to view the visual comparison of indicators.

Publish the Model

After model training or evaluation, the trained weight can be published as a new model or version.

On the project details page, for a task that has been completed, click the "More - Publish" button to enter the publishing page.

On the publishing page, fill in the following parameters:

  • Model Name: If the entered name is the same as an existing model, it will be published as a new version of the existing model; otherwise, it will be published as a new model.

  • Description: Fill in the description information of the model.

  • Tags: You can create new tags or select existing ones to facilitate the description and filtering of models.

  • Published Content: Select files from the training task output directory. They will be copied to the model directory of the new model package during publishing.

After filling in the information, click the publish button, and the model will be published to the model library according to the above information. In the training task list, you can see the status of this publication.

Configuration Update After Publishing

The training configuration, evaluation configuration, and inference configuration information of the newly published model are the same as those of the training model. If adjustments are needed, please go to the model details in the model library to update them.