Common Issues Feedback of the AI Platform
1. Can local data be used on the platform?
Yes. You can upload data through the user's home directory or copy data to the user's home directory via the background.
2. Can data on the platform be used locally?
Yes. You can download data through the user's home directory or copy data to your own directory via the background.
3. How can I use my own data in the container service?
When the container service starts, the user's home directory, job data area, and shared data area will be mounted by default. You can directly access the data files in the user's home directory, job data area, and shared data area, and the directory paths are consistent with those of the host machine.
4. How can I save the files created in the container service?
When the container service starts, the user's home directory, job data area, and shared data area will be mounted by default. You can copy the files that need to be saved to the user's home directory, job data area, or shared data area.
5. How can I install Python packages or RPM packages in an offline intranet environment?
You can download the corresponding packages and their dependent packages from the Internet, import them into the intranet, and install them directly. Or you can set up an offline pip and conda source and install packages directly.
6. How can I install dependent packages in a non-root launched image?
Execute yum or apt-get commands with sudo.
7. The AI pages such as PyTorch and TensorFlow fail to load and do not display properly.
The appform session has expired. You can find the expiration information in portal.log and jhai.log. Just log in to the portal again to solve this problem.
8. When the compute node cannot access the external network IP of the Jingxing application portal node, the container service for model deployment cannot be used for inference in the Jupyter container.
The Jupyter container needs to be able to access the external network IP of the Jingxing application portal node. If you don't know the external network IP, contact the administrator to obtain it.
9. When executing mount in the container desktop, the error "mount: /ubuntuxwf/: mount failed: Operation not permitted." is reported.
For security considerations of the container desktop, the privileged permission is not enabled. If you need to use files, you can unzip the files that need to be mounted and put them into the container for use.
10. Sometimes when TensorBoard is opened, the page is empty and the main TensorBoard page fails to load.
It is possible that the JSON file required by TensorBoard has not been fully loaded. Close the page and open it again.
11. When clicking "tensorboard" on the job details or solution details page, the corresponding TensorBoard service with the ID cannot be found in the container service.
The job and solution share the same TensorBoard service, and the service name is tensorboard-*. The current content is the latest opened TensorBoard. Multiple TensorBoard services can be created in the container service, which has nothing to do with the ones opened in the job or solution.
12. When starting the container service, an error is reported: "service start failed, current status is: rejected, reason is: No such image: ahaha/aaa:v_testa1".
When creating the container service, the selected/entered image name does not exist, or the image name cached in the browser page actually no longer exists. Clear the browser cache and enter/select the real existing image.
13. When using Firefox in the desktop container, getting the content of the clipboard fails, resulting in the inability to paste the content copied from the local.
There are security restrictions on clipboard access in Firefox. You can use the browser recommended on the portal login page.
14. In Jupyter, the kernel is restarting, or Jupyter reports that the memory is full or "CUDA error: out of memory".
Close some running scripts to release memory or video memory.
15. The program running in VSCode is directly killed.
It may be caused by running out of memory. Check whether the memory of the development environment is used up, release the memory, or try to switch to a hardware specification with a higher version of memory, or adjust the program to reduce memory usage.
16. When starting the service for the first time or waiting for a long time for a job with no error reported in the log.
Since the image is large, it takes some time to pull it from the image repository when starting for the first time.
17. Why do the mounted files become part of the image when modifying the image?
During the process of modifying the image, the mounted directory will be copied into the image.
18. How can I use my own image on the platform?
Just follow the image adaptation rules to command the image.
19. When running a program directly in the image repository, will the training-related data be terminated directly when closing the image or the browser?
Yes. When running a job, it is better not to open the image directly. You should use the container service, job submission, or the development center. The job will not disappear when closing the terminal or the web page.
20. How can I use storage of ssd or hdd type?
Log in to the platform at https://hpc2login.hpc.hkust-gz.edu.cn/appform/desktop --> My Data --> The corresponding path of the job data area is: /hpc2ssd/JH_DATA/spooler/zhangxxxxxx; The corresponding path of the shared data area is: /hpc2hdd/JH_DATA/share/zhangxxxx --> Or enter df -h in the ai container. If it starts with /hpc2ssd, it is ssd storage, and if it starts with /hpc2hdd, it is hdd storage.
21. Can't install software in the container?
Switch to the root user and install software using the sudo -i command.
22. How can I activate the anaconda3 environment in the script?
source /hpc2ssd/softwares/anaconda3/bin/activate xxxx environment name
23. Can't see the ssh address and password in the ai development center container?
For the self-imported image, if it is not an image officially provided, you need to manually modify the container configuration and save it as a new image to display the ssh address and password normally. For details, please visit https://docs.hpc.hkust-gz.edu.cn/hpc/howtoconnect/image-ssh
24. When creating a new container in the development center, it takes a long time to start?
If the container image is large, you need to wait for a while.
25. In the Slurm cluster, for the same script, sometimes the job runs for a long time?
Add a line to the script, for example: #SBATCH-x cpu1-17,cpu1-108 to exclude the suspicious compute nodes.