Cloudflare + AI로 내부 개발 환경 자동 관리하는 방법 (실무 코드 포함)

내부 개발 환경을 운영하다 보면 접속 경로 관리와 장애 대응에 많은 시간이 소요됩니다. Cloudflare Tunnel을 사용하면 서버 인바운드 포트를 열지 않고도 내부 서비스를 안전하게 외부에 노출할 수 있습니다. 여기에 AI 기반 장애 분석을 붙이면 단순 모니터링을 넘어 “원인 분석 + 조치 가이드 자동 생성”까지 가능합니다.

1. Cloudflare Tunnel 설정

예시 환경

code-server: localhost:8080
grafana: localhost:3000

 # /etc/cloudflared/config.yml
 tunnel: dev-tools-tunnel
 credentials-file: /etc/cloudflared/dev-tools-tunnel.json
 
 ingress:
 - hostname: code.example.com
 service: http://localhost:8080
 
 - hostname: grafana.example.com
 service: http://localhost:3000
 
 - service: http_status:404

서비스 등록

 sudo systemctl daemon-reload
 sudo systemctl enable --now cloudflared
 sudo systemctl status cloudflared

2. 외부 URL 기준 헬스체크

사용자가 실제로 접근하는 URL 기준으로 상태를 점검해야 의미가 있습니다.

 import requests
 from datetime import datetime
 
 URLS = [
 "https://code.example.com/healthz",
 "https://grafana.example.com/api/health",
 ]
 
 def healthcheck():
 results = []
 for url in URLS:
 try:
 r = requests.get(url, timeout=5)
 ok = 200 <= r.status_code < 300
 results.append({
 "url": url,
 "ok": ok,
 "status": r.status_code,
 "time": datetime.utcnow().isoformat()
 })
 except Exception as e:
 results.append({
 "url": url,
 "ok": False,
 "status": None,
 "error": str(e),
 "time": datetime.utcnow().isoformat()
 })
 return results

3. cloudflared 로그 수집

 import subprocess
 
 def tail_cloudflared_logs(lines=200):
 cmd = ["journalctl", "-u", "cloudflared", "-n", str(lines), "--no-pager"]
 r = subprocess.run(cmd, capture_output=True, text=True)
 return r.stdout[-20000:]

4. OpenAI 기반 장애 분석

 from openai import OpenAI
 
 client = OpenAI()
 
 def ai_triage(health_results, log_tail):
 prompt = f"""
 너는 SRE다.
 다음 정보를 기반으로 장애 원인과 조치 가이드를 작성하라.
 
 출력 형식:
 1. 상황 요약
 2. 의심 원인 Top 3
 3. 확인 절차
 4. 즉시 조치
 5. 재발 방지
 
 헬스체크 결과:
 {health_results}
 
 로그:
 {log_tail}
 """
 
 resp = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": prompt}],
 temperature=0.2
 )
 
 return resp.choices[0].message.content

5. Slack 알림 전송

 import os
 import requests
 
 SLACK_WEBHOOK = os.getenv("SLACK_WEBHOOK")
 
 def send_slack(message):
 if not SLACK_WEBHOOK:
 return
 payload = {"text": message}
 requests.post(SLACK_WEBHOOK, json=payload)

6. 전체 실행 흐름

 def main():
 results = healthcheck()
 failed = [r for r in results if not r.get("ok")]
 
 if not failed:
 return
 
 logs = tail_cloudflared_logs()
 report = ai_triage(results, logs)
 send_slack(report)
 
 if __name__ == "__main__":
 main()

7. cron 등록

 */10 * * * * /usr/bin/python3 /opt/devops/devtools_monitor.py

실무 효과

포트 오픈 없이 내부 서비스 안전 노출
장애 발생 시 자동 원인 정리
운영 대응 시간 단축
반복 장애 패턴 분석 가능

이 구조는 단순 모니터링이 아니라 운영 대응을 표준화하는 자동화입니다.