AI Education Platform

The Challenge

A government-funded healthcare AI research initiative in Korea needed a platform for training medical professionals on AI/ML using specialized medical datasets. Commercial cloud solutions like Google Colab couldn't provide unlimited GPU access, couldn't host proprietary medical data securely, and lacked the isolated multi-user environment required for structured training programs.

This was part of a national research project focused on advancing healthcare AI capabilities. The platform needed to support hands-on training sessions where 40 students could simultaneously work with GPU-intensive deep learning models on prepared medical datasets, all while maintaining data isolation and security.

Key Constraints

Support exactly 40 concurrent students (limited by 4x Tesla V100 GPUs)
Provide isolated workspaces with shared read-only datasets and private user folders
Enable unlimited GPU usage during training (unlike commercial cloud limits)
Secure handling of proprietary medical training datasets
Real-time monitoring of GPU utilization across MIG partitions
Cost-effective alternative to commercial cloud services

Our Approach

Built a self-hosted Jupyter platform with dual-service architecture: NestJS GraphQL API for user management and session control, and FastAPI service for Docker container lifecycle and GPU allocation. Used NVIDIA MIG (Multi-Instance GPU) to partition 4 Tesla V100 GPUs into isolated instances for fair resource distribution.

Key Technical Decisions

NestJS GraphQL + FastAPI dual architecture - separation of concerns between user management and container orchestration
NVIDIA MIG for GPU partitioning - fair resource allocation across 40 concurrent users from 4 physical GPUs
Docker for environment isolation - each student gets isolated Jupyter container with mounted shared/private volumes
Custom Prometheus GPU exporters - built custom monitoring solution since standard tools don't support MIG partitioning
MongoDB for flexibility - rapid iteration on user/session schemas during government project development
Socket.io for real-time updates - instant container status and resource usage feedback to students and instructors

Timeline: 6 months from initial planning to production deployment with full monitoring stack

Implementation

Architecture Design & Infrastructure Setup

Designed dual-service architecture, set up GPU servers with NVIDIA MIG, configured Docker networking, and established development environment. Planned resource allocation strategy for 40 concurrent users across 4 Tesla V100 GPUs.

4-6 weeks

Core Platform Development

Built NestJS GraphQL API with role-based authentication, user management, and session control. Developed FastAPI service with Docker SDK for container lifecycle management and GPU allocation via NVIDIA runtime.

8-10 weeks

Container Orchestration System

Most critical phase - implemented sophisticated container management system with volume mounting (shared read-only medical datasets + private user folders), GPU resource assignment via MIG, and automatic cleanup. Required significant debugging and optimization.

6-8 weeks

Custom Monitoring Solution

Built custom Prometheus exporters for MIG-aware GPU monitoring (most standard tools don't support MIG). Integrated Grafana dashboards for real-time GPU utilization, memory usage, and per-user resource tracking. Essential for managing concurrent workloads.

3-4 weeks

Frontend & Real-time Features

Developed Next.js frontend with Apollo Client for GraphQL, Socket.io for real-time container status updates, Jupyter notebook management interface, and instructor dashboard for monitoring all student sessions.

4-5 weeks

Testing & Production Deployment

Load testing with 40 concurrent users, stress testing GPU allocation under peak loads, security testing for container isolation, and production deployment with SSL/Nginx reverse proxy setup.

2-3 weeks

System Architecture

AI Education Platform architecture with NestJS GraphQL API, FastAPI Docker orchestration, NVIDIA MIG GPU partitioning across 4 Tesla V100 GPUs, and Prometheus monitoring stack

The platform uses NVIDIA MIG to partition each Tesla V100 into multiple GPU instances (10 instances per GPU = 40 total instances for 40 users). FastAPI service manages Docker containers with NVIDIA runtime, mounting shared datasets (read-only) and private user volumes. NestJS GraphQL API handles authentication, user CRUD, and session management. Custom Prometheus exporters query NVIDIA MIG metrics via nvidia-smi, exposing per-partition GPU utilization and memory usage. Grafana dashboards visualize real-time resource consumption across all student sessions. Next.js frontend uses Apollo Client for GraphQL queries/mutations and Socket.io for real-time container status updates.

Technology Stack

Next.jsTypeScriptApollo ClientSocket.ioNestJSFastAPIGraphQLDocker SDKMongoDBDockerDocker ComposeNginxSSLPrometheusGrafanaNode ExporterCustom GPU ExportersJupyter NotebookPyTorchTensorFlowScikit-learn

Results & Impact

40Concurrent Users

Successfully supported 40 students simultaneously running GPU-intensive deep learning workloads

4x V100Tesla GPUs

Partitioned via NVIDIA MIG to provide isolated GPU instances for each student

100%GPU Utilization

Tested peak concurrent load across all GPU partitions with real medical AI training workloads

6 monthsDevelopment

From architecture design to production deployment with monitoring stack

Enabled unlimited GPU training for medical AI research (no cloud quota limits)
Provided secure environment for proprietary medical datasets (on-premise control)
Replaced need for expensive commercial cloud services (Google Colab Pro, AWS SageMaker)
Successfully conducted multiple training sessions for healthcare professionals
Achieved fair resource distribution across 40 users via MIG partitioning
Real-time monitoring enabled proactive resource management and troubleshooting

What We Learned

Container orchestration at scale is complex - extensive work needed on lifecycle management, volume mounting, and resource cleanup. Docker SDK requires careful error handling.
NVIDIA MIG monitoring gaps - most standard monitoring tools don't support MIG, requiring custom Prometheus exporters. This was unexpected and time-consuming.
Microservices separation paid off - splitting user management (NestJS) from container control (FastAPI) made development and debugging much easier despite initial overhead.
Real-time feedback is critical - Socket.io updates for container status significantly improved UX, students could see GPU allocation and startup progress instantly.
Load testing is essential - simulating 40 concurrent users revealed race conditions in GPU allocation that wouldn't appear with small user counts.
Government projects require flexibility - MongoDB's schema flexibility was valuable during requirement changes mid-development. Postgres would have slowed iterations.