High Latency on Custom YOLOv8 Model Deployment (Tokyo Region)

Hello community,

I'm currently deploying a custom YOLOv8 (small) object detection model using the JapanCV Cloud model hosting service. The model container was successfully uploaded and is running on the recommended instance type: jp-standard-gpu-m.

We are making direct API calls to the /v1/inference/tokyo endpoint, but we are consistently observing high latency of 800ms – 950ms for inference on standard 640x640 images. This latency is observed even when executing the client from a low-latency environment within the same Tokyo region VPC.

I have already verified that the model's pre-processing step is efficient.

My precise question is: Is there a specific instance_scaling_policy setting or an advanced configuration parameter in the deployment manifest (YAML) that can prioritize lower latency over throughput, potentially forcing the use of a more dedicated slice of the GPU resource? We need to consistently achieve sub-500ms response times for production readiness.

Thank you for any insight you can provide! ragdoll hit