Gluttony-Cluster/vllm/vllm-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: vllm-ns
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-inference-server
  template:
    metadata:
      labels:
        app: vllm-inference-server
    spec:
      runtimeClassName: nvidia
      containers:
        - name: vllm-inference-server
          image: vllm/vllm-openai:latest
          imagePullPolicy: IfNotPresent

          resources:
            limits:
              nvidia.com/gpu: 2
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              value: ""
            - name: TRANSFORMERS_CACHE
              value: /.cache
            - name: shm-size
              value: 1g
                #command: ["/bin/bash", "-c"]
                #args:
                #- while true; do sleep 2600; done
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args: ["--model=openai-community/gpt2",
                 "--gpu-memory-utilization=0.95",
                 "--disable-log-requests",
                 "--trust-remote-code",
                 "--port=8000",
                 "--dtype=half",
                 "--tensor-parallel-size=2"]
          ports:
            - containerPort: 8000
              name: http
          securityContext:
            runAsUser: 1000
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            - mountPath: /.cache
              name: cache
      volumes:
       - name: cache
         emptyDir: {}
       - name: dshm
         emptyDir:
              medium: Memory
Add back vllm 2024-03-31 02:25:35 +00:00			`apiVersion: apps/v1`
			`kind: Deployment`
			`metadata:`
			`name: vllm-server`
			`namespace: vllm-ns`
			`spec:`
			`replicas: 1`
			`selector:`
			`matchLabels:`
			`app: vllm-inference-server`
			`template:`
			`metadata:`
			`labels:`
			`app: vllm-inference-server`
			`spec:`
Add nvidia runtime 2024-03-31 02:39:54 +00:00			`runtimeClassName: nvidia`
Add back vllm 2024-03-31 02:25:35 +00:00			`containers:`
			`- name: vllm-inference-server`
Upgrade version and use gpt2 2024-03-31 03:03:01 +00:00			`image: vllm/vllm-openai:latest`
Add back vllm 2024-03-31 02:25:35 +00:00			`imagePullPolicy: IfNotPresent`

			`resources:`
			`limits:`
Update dtype param 2024-03-31 02:45:48 +00:00			`nvidia.com/gpu: 2`
Add back vllm 2024-03-31 02:25:35 +00:00			`env:`
			`- name: HUGGING_FACE_HUB_TOKEN`
			`value: ""`
			`- name: TRANSFORMERS_CACHE`
			`value: /.cache`
			`- name: shm-size`
			`value: 1g`
Update version back to version 0.3.3 2024-03-31 02:35:21 +00:00			`#command: ["/bin/bash", "-c"]`
			`#args:`
			`#- while true; do sleep 2600; done`
			`command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]`
Upgrade version and use gpt2 2024-03-31 03:03:01 +00:00			`args: ["--model=openai-community/gpt2",`
Update version back to version 0.3.3 2024-03-31 02:35:21 +00:00			`"--gpu-memory-utilization=0.95",`
			`"--disable-log-requests",`
			`"--trust-remote-code",`
			`"--port=8000",`
Update dtype param 2024-03-31 02:45:48 +00:00			`"--dtype=half",`
Update params to distribute model 2024-03-31 02:49:18 +00:00			`"--tensor-parallel-size=2"]`
Add back vllm 2024-03-31 02:25:35 +00:00			`ports:`
			`- containerPort: 8000`
			`name: http`
			`securityContext:`
			`runAsUser: 1000`
			`volumeMounts:`
			`- mountPath: /dev/shm`
			`name: dshm`
			`- mountPath: /.cache`
			`name: cache`
			`volumes:`
			`- name: cache`
			`emptyDir: {}`
			`- name: dshm`
			`emptyDir:`
			`medium: Memory`