User Manual

After successfully logging in, the system page is divided into three sections: the left side contains the system function menu, the top contains the application template list, and the right top features the public menu bar, as shown in the image below:

step2

Resource Viewing

1) After logging into the cluster, click "My Resources" at the top right of the page.

step3

2) View information about the remaining platform resources.

step4

Data Management

After logging into the platform, select "Data Management" on the left side to expand the data management page. Right-click on the page to upload and download files and directories: 1) Upload files and directories

step5

2) Download files and directories

step6

Application Download

1) If the desktop does not have the icon for the corresponding software, download the required software template from the Application Center. Download method: Application Center -> Click the download button next to the software template

step7

Training Task Submission Method Based on Web + Slurm

Taking the PyTorch software as an example, the steps for submitting a web-based task are as follows: 1) Use the data management interface of the web to upload the PyTorch case file to your home directory on the cluster (please ensure it is in a folder).

step8

2) Select the corresponding PyTorch icon to enter:

step9

3) After clicking on the PyTorch icon, you will be in the "List" tab by default. Click the "Submit" icon below.

step10

4) Fill in the parameters:

Job Name: Enter the current job name, which must not contain Chinese characters, spaces, or special characters;
Queue: Select the acd queue;
CPU Tasks - Number of Cores: Number of cores required for program execution;
GPU Tasks - Number of Nodes: Number of hosts required for program execution;
GPU Tasks - Number of Cards per Node: Select the number of GPU cards needed per node;
Python Execution File: Select the Python file for the PyTorch program execution;
Custom Parameters: Enter any additional Python parameters if needed; otherwise, leave blank.

5) After filling in the parameters, click "Submit" to complete the submission.

step11

Fields marked with a red * are mandatory.

6) After successful submission, a notification will appear at the top of the web page indicating successful job submission. You can click the link to enter the list interface for job management or use the "Job Management" on the left to enter and view the job status:

step1

7) Click on the name of the corresponding running job to view details, CPU usage, and summary information. Click " " to perform operations such as deleting, modifying, or stopping the job.

step2

8) Once the computation is complete, you can enter data management through the job list link or find the corresponding data for download and post-processing analysis via data management.

step3 step4

Training Task Submission Method Based on CMD + Slurm

Taking the PyTorch software as an example, the steps for submitting a task using the command line are as follows: 1) Open the Shell application template

step5

2) Upload the case file agpu.py into the PyTorch folder

step6

3) The sinfo command is used to query the idle status and resource information of nodes in each partition of the cluster. The squeue command is used to view the queue status and job queue information of submitted jobs.

step7

4) Use sbatch to submit tasks and squeue to check task status.

step8

5) Use scancel jobid to cancel a submitted task.

AI Development Process

After successfully logging in, click the "AI Studio" function menu to enter the user portal platform, as shown in the image below:

step10 step11

1 Operating Process

Step 1: Prepare data. Prepare the dataset, code, and image related to object detection.

Step 2: Code development (optional). Write or use your own training and inference code, debug it to ensure it runs correctly in the development environment, integrate platform functions, and publish to the model library. If using a pre-set model from the platform, you can skip this step.

Step 3: Model training. Create a model training task and publish the optimized model to the model library.

Step 4: Application deployment. Deploy the model as an online inference service.

Step 5: Service invocation. Use the API for service invocation and return inference results.

2 Data Preparation

First, prepare the dataset for model training and evaluation, the algorithm code for the object detection model, and the image supporting model operation.

2.1 Prepare the Dataset

In this case, the COCO dataset is used. You can download the COCO dataset and upload it to the platform:

2017 Train images [118K/18GB]
2017 Val images [5K/1GB]
2017 Test images [41K/6GB]
2017 Train/Val annotations [241MB]

After downloading the official training set and validation set, place their respective images and annotation files together, create a zip or tar archive, and then upload the archive to the platform.

Upload Dataset In the left navigation bar, go to "Data Services > Dataset Management," and click the "Create Dataset" button on the dataset repository page.

step12

Enter the dataset creation page and fill in the following information.

step13

Click the "Create" button, and the page will redirect to the dataset repository page. Click the name of the dataset you just created to enter the details page. You will see that a version 0.1 is automatically created with the status "Creating." Wait for the status to change to "Success," indicating that the dataset upload is complete.

step14

Note that training, validation, and test sets need to be created separately.

2.2 Prepare Algorithm Code

In this case, the SSD algorithm is used for object detection. SSD (Single Shot MultiBox Detector) is an object detection algorithm proposed by Wei Liu at ECCV 2016 and is one of the main detection frameworks. The model can be downloaded locally via cloud disk (cloud disk download).

View Pre-set Algorithm Models In the left navigation bar, go to "Model Development and Training > Model Management > Model Library," search for the pre-set model name, find the corresponding pre-set model, and click the model name to enter the model details.

step15

In the model details page, click the corresponding version number to enter version details.

step16

In the version details, you can see the basic information, training, evaluation, compression, and inference configuration of the algorithm model, and you can edit them.

step17

2.3 Prepare Image

The platform runs algorithm models in an image. An image is a packaged file containing all the necessary software, libraries, dependencies, and configurations, providing a portable and consistent execution environment for the model, ensuring the model runs consistently across different systems or platforms. The platform has pre-set the relevant image for the SSD algorithm in this case, which can be directly used.

Purpose	Compute Resource Type	Pre-set Image Name	Pre-set Image Version
Code development environment, model training tasks	NVIDIA GPU	ms2.2_cuda11.6_gpu	atc_0.1

3 Code Development

3.1 Launch Development Environment

In the left navigation bar, go to "Model Development and Training > Code Development," and click the "New Project" button on the list page.

step18

In the new project popup, fill in the following information.

step19

Click the "Submit" button, and the page will redirect to the development project list page. Click the name of the project you just created to enter the details page, and click the "New Development Environment" button.

step20

In the new environment page, fill in the following information. Select the prepared image, dataset, and choose GPU for resources.

step21

Click the "Launch Environment" button. You will see the environment status as "Scheduling." Wait until the status changes to "Running," at which point the development environment is ready for use.

step22

The page displays the paths for each directory, including code, datasets, and outputs, which you can copy and use. Please adhere to the usage guidelines for each directory.

3.2 Debugging Algorithm Code

After the development environment is launched, open the web version of VSCode in the development toolbar. Prepare the SSD algorithm code by dragging the files from its code folder to the code directory on the left side of VSCode, and drag the files from its model folder to the outputs directory on the left side of VSCode. Once the upload is complete, you can start debugging the code.

step23

Open the Terminal from the menu and run the code for debugging.

step24

Enter the following command line to debug the training code of the algorithm, ensuring it can be trained normally in the code development environment.

bash train.sh --data_path /home/aistudio-user/adhub/coco2017_train/0.1 --output_path /home/aistudio-user/outputs --epoch_size 1 --batch_size 32 --lr 0.001 --mode not --run_platform GPU --save_checkpoint_epochs 10 --dataset custom

Enter the following command line to debug the inference code of the algorithm, ensuring it can infer normally in the code development environment.

bash start.sh --model-path /home/aistudio-user/outputs --soc_version Ascend310 --quant false

If you want the inference code to provide an online service, you need to write a corresponding inference service framework. Templates and examples are provided in the sample directory of the development environment. See the README document for details. step24 In this example of the SSD algorithm code, the inference service framework has been written and can be used directly without debugging.

3.3 Integrating Platform Functions

If you want the algorithm model to perform model training, evaluation, and inference on the platform, you need to integrate platform parameters into the code. This section only covers integration for model training and inference; other configurations are similar.

(1) Integrating Platform Model Training

When launching a model training task, the platform dynamically generates the training dataset path, training task output path, and incremental training weight path (red box ① in the image below) and passes them as command line parameters (red box ② in the image below) to the training start script of the algorithm code. The algorithm code needs to handle the corresponding content.

step

In the SSD algorithm code of this example, open train.py, and the get_args function processes the corresponding content, as shown in the image below.

step

(2) Integrating Platform Application Deployment

When deploying online services, the platform dynamically generates the inference model path (red box ① in the image below) and passes it as a command line parameter (red box ② in the image below) to the inference start script of the algorithm code. The algorithm code needs to handle the corresponding content.

step

In the SSD algorithm code of this example, open train.py, and the get_args function processes the corresponding content, as shown in the image below.

step

3.4 Publishing to the Model Library

Once the algorithm code has been debugged, it can be published to the model library for training, evaluation, and inference. Note that if the code development environment has installed new software packages that are dependencies for running the model, please save the image before publishing the model; otherwise, the model may not run properly.

In the development environment, click the "Publish Model" button to proceed with publishing.

step

In the model publishing page, configure the basic information, training configuration, and inference configuration for the model in this case.

Basic Information: Select the content and tags to be published.
Training Configuration: Select the startup script train.sh for the launch command. The "Command-line parameter names received by the code" must match the names the code actually receives, or an error will occur. Enter hyperparameters that the code can handle and support. After entering, you can preview the launch command.
Inference Configuration: Select the startup script start.sh for the launch command. The "Command-line parameter names received by the code" must match the names the code actually receives. Enter inference parameters that the code can handle and support. After entering, you can preview the launch command.

After configuration, click the "Submit" button. The page will redirect to the version details page of the model library, where you can see a new version has been automatically created. The status will be "Importing." Wait until the status changes to "Import Successful" to complete the model publishing. step

4 Model Training

The algorithm model debugged in the development environment still needs to undergo large-scale training to optimize its parameters and enable accurate prediction and identification of new data.

4.1 Creating a Training Task

Navigate to "Model Development and Training > Model Training" through the left navigation bar, and click the "New Project" button on the list page.

step

In the new project popup, fill in the following information.

step

Click the "Submit" button. The page will redirect to the training project list page. Click the name of the project you just created to enter the details page, and click the "New Task" button.

step

In the new model training task page, fill in the following information. Select the model you just published, the prepared training dataset, adjust hyperparameters as needed, choose GPU for training resources, and note that memory needs to be adjusted based on the dataset size; otherwise, the training process may report an error. 16GB is recommended here.

step

Click the "Submit" button. The page will redirect to the training task list page. Click the task number you just created to enter the details page, where you can see the task status as "Running." Click "Log Information" for auto-refresh.

step

When the task status changes to "Completed," the training is finished. In the task list, click "Browse Files" to view the task output files. You will see that the training outputted the ssd.mindir file.

step

4.2 Publishing the Trained Model to the Model Library

After model training or evaluation, the training output ssd.mindir can be published as a new version.

In the project details page, for tasks that have been completed, click the "More - Publish" button to enter the publishing page.

step

On the publishing page, fill in the following parameters.

step

After filling in, click the publish button, and the model will be published to the model library according to the above information. In the training task list, you can see the status of this publication.

5 Application Deployment

Trained models can be deployed as online services. Navigate to "Application Deployment > Central Deployment > Online Service" through the left navigation bar, and then click the "Create Inference Service" button. step

In the creation page, fill in the following parameters and select the model version you just published.

step

Click "Submit," and the page will redirect to the online service list page. In the list, you can see the online service you just deployed. Click the name of this service to view details.

step

Wait until the task status changes to "Running," and the model deployment is successful, providing online inference services. In the details page, go to the prediction tab, upload an image to check the inference effect.

step

If the inference result is displayed correctly, it indicates that the online service is running normally.

step

6 Service Invocation

Once the online service is running normally, click the invocation guide tab to copy the access URL.

step

In this example, using the HTTP method POST, the URL parameters are as follows: | Parameter | Value | | --------- | ------------------ | | data | Base64 encoding of the image | | name | Image name |

In this example, the Header is as follows | Parameter | Value | | ------ | ------------------- | | Content-Type | application/json |

Body request parameter:

{
    "requests": [
        {
            "data": "/9j/4R9ERXhpZgAA…………KVGxphQX/2Q==",    // data is the Base64 code of the image
            "name": "image01.jpg"
        }
    ]
}

Respond example：
```json
{
    "code": 0,
    "msg": "",
    "data": [
        {
            "serviceName": "edgeinfer-1721201546190962310",
            "modelId": "95",
            "modelVersionId": "196",
            "result": "OK",
            "predict": {
                "meta": [],
                "shapes": [],
                "tags": []
            },
            "inferenceTime": {
                "infer": 0,
                "postprocess": 0,
                "preprocess": 0
            },
            "latency": 0,
            "requestTime": 0,
            "responseTime": 0
        }
    ]
}

In the inference results tab, you can view the results of the object detection inference, as well as filter the inference results and create datasets.

step

User Manual

Resource Viewing​

Data Management​

Application Download​

Training Task Submission Method Based on Web + Slurm​

Training Task Submission Method Based on CMD + Slurm​

AI Development Process​

1 Operating Process​

2 Data Preparation​

2.1 Prepare the Dataset​

2.2 Prepare Algorithm Code​

2.3 Prepare Image​

3 Code Development​

3.1 Launch Development Environment​

3.2 Debugging Algorithm Code​

3.3 Integrating Platform Functions​

3.4 Publishing to the Model Library​

4 Model Training​

4.1 Creating a Training Task​

4.2 Publishing the Trained Model to the Model Library​

5 Application Deployment​

6 Service Invocation​

Resource Viewing

Data Management

Application Download

Training Task Submission Method Based on Web + Slurm

Training Task Submission Method Based on CMD + Slurm

AI Development Process

1 Operating Process

2 Data Preparation

2.1 Prepare the Dataset

2.2 Prepare Algorithm Code

2.3 Prepare Image

3 Code Development

3.1 Launch Development Environment

3.2 Debugging Algorithm Code

3.3 Integrating Platform Functions

3.4 Publishing to the Model Library

4 Model Training

4.1 Creating a Training Task

4.2 Publishing the Trained Model to the Model Library

5 Application Deployment

6 Service Invocation