The Challenge
A government-funded healthcare AI research initiative in Korea needed a platform for training medical professionals on AI/ML using specialized medical datasets. Commercial cloud solutions like Google Colab couldn't provide unlimited GPU access, couldn't host proprietary medical data securely, and lacked the isolated multi-user environment required for structured training programs.
This was part of a national research project focused on advancing healthcare AI capabilities. The platform needed to support hands-on training sessions where 40 students could simultaneously work with GPU-intensive deep learning models on prepared medical datasets, all while maintaining data isolation and security.
Key Constraints
- Support exactly 40 concurrent students (limited by 4x Tesla V100 GPUs)
- Provide isolated workspaces with shared read-only datasets and private user folders
- Enable unlimited GPU usage during training (unlike commercial cloud limits)
- Secure handling of proprietary medical training datasets
- Real-time monitoring of GPU utilization across MIG partitions
- Cost-effective alternative to commercial cloud services
Our Approach
Built a self-hosted Jupyter platform with dual-service architecture: NestJS GraphQL API for user management and session control, and FastAPI service for Docker container lifecycle and GPU allocation. Used NVIDIA MIG (Multi-Instance GPU) to partition 4 Tesla V100 GPUs into isolated instances for fair resource distribution.
Key Technical Decisions
- NestJS GraphQL + FastAPI dual architecture - separation of concerns between user management and container orchestration
- NVIDIA MIG for GPU partitioning - fair resource allocation across 40 concurrent users from 4 physical GPUs
- Docker for environment isolation - each student gets isolated Jupyter container with mounted shared/private volumes
- Custom Prometheus GPU exporters - built custom monitoring solution since standard tools don't support MIG partitioning
- MongoDB for flexibility - rapid iteration on user/session schemas during government project development
- Socket.io for real-time updates - instant container status and resource usage feedback to students and instructors
Timeline: 6 months from initial planning to production deployment with full monitoring stack
Implementation
Architecture Design & Infrastructure Setup
Designed dual-service architecture, set up GPU servers with NVIDIA MIG, configured Docker networking, and established development environment. Planned resource allocation strategy for 40 concurrent users across 4 Tesla V100 GPUs.
4-6 weeksCore Platform Development
Built NestJS GraphQL API with role-based authentication, user management, and session control. Developed FastAPI service with Docker SDK for container lifecycle management and GPU allocation via NVIDIA runtime.
8-10 weeksContainer Orchestration System
Most critical phase - implemented sophisticated container management system with volume mounting (shared read-only medical datasets + private user folders), GPU resource assignment via MIG, and automatic cleanup. Required significant debugging and optimization.
6-8 weeksCustom Monitoring Solution
Built custom Prometheus exporters for MIG-aware GPU monitoring (most standard tools don't support MIG). Integrated Grafana dashboards for real-time GPU utilization, memory usage, and per-user resource tracking. Essential for managing concurrent workloads.
3-4 weeksFrontend & Real-time Features
Developed Next.js frontend with Apollo Client for GraphQL, Socket.io for real-time container status updates, Jupyter notebook management interface, and instructor dashboard for monitoring all student sessions.
4-5 weeksTesting & Production Deployment
Load testing with 40 concurrent users, stress testing GPU allocation under peak loads, security testing for container isolation, and production deployment with SSL/Nginx reverse proxy setup.
2-3 weeksSystem Architecture

The platform uses NVIDIA MIG to partition each Tesla V100 into multiple GPU instances (10 instances per GPU = 40 total instances for 40 users). FastAPI service manages Docker containers with NVIDIA runtime, mounting shared datasets (read-only) and private user volumes. NestJS GraphQL API handles authentication, user CRUD, and session management. Custom Prometheus exporters query NVIDIA MIG metrics via nvidia-smi, exposing per-partition GPU utilization and memory usage. Grafana dashboards visualize real-time resource consumption across all student sessions. Next.js frontend uses Apollo Client for GraphQL queries/mutations and Socket.io for real-time container status updates.
Technology Stack
Results & Impact
Successfully supported 40 students simultaneously running GPU-intensive deep learning workloads
Partitioned via NVIDIA MIG to provide isolated GPU instances for each student
Tested peak concurrent load across all GPU partitions with real medical AI training workloads
From architecture design to production deployment with monitoring stack
- Enabled unlimited GPU training for medical AI research (no cloud quota limits)
- Provided secure environment for proprietary medical datasets (on-premise control)
- Replaced need for expensive commercial cloud services (Google Colab Pro, AWS SageMaker)
- Successfully conducted multiple training sessions for healthcare professionals
- Achieved fair resource distribution across 40 users via MIG partitioning
- Real-time monitoring enabled proactive resource management and troubleshooting
What We Learned
- Container orchestration at scale is complex - extensive work needed on lifecycle management, volume mounting, and resource cleanup. Docker SDK requires careful error handling.
- NVIDIA MIG monitoring gaps - most standard monitoring tools don't support MIG, requiring custom Prometheus exporters. This was unexpected and time-consuming.
- Microservices separation paid off - splitting user management (NestJS) from container control (FastAPI) made development and debugging much easier despite initial overhead.
- Real-time feedback is critical - Socket.io updates for container status significantly improved UX, students could see GPU allocation and startup progress instantly.
- Load testing is essential - simulating 40 concurrent users revealed race conditions in GPU allocation that wouldn't appear with small user counts.
- Government projects require flexibility - MongoDB's schema flexibility was valuable during requirement changes mid-development. Postgres would have slowed iterations.




