Cloud Platform Engineering · AI Services · Production Systems

Tal Pal Attia

Rochester, Minnesota

Lowering the barrier to innovation through practical cloud and AI infrastructure.

I am a software engineer at Mayo Clinic IT focused on cloud-native platforms and AI services. I design and operate secure backend services, agentic AI tooling, and GPU-ready compute environments that help teams build and deploy AI workloads reliably.

My work spans Terraform-based automation on GCP, secure backend services for AI access, SLURM-based GPU clusters, and observability for ML systems. I enjoy building platform capabilities that improve developer velocity while raising the reliability and safety bar by default.

Photo of Tal Pal Attia

Cloud Platform & AI Services

Software Engineer, Generative AI Program · Mayo Clinic IT

M.Sc. Biomedical Engineering and Physiology · Mayo Clinic Graduate School

B.Sc. Industrial Engineering and Management · Ben Gurion University

Platform Impact

The production capabilities I’ve designed, deployed, and operated.
This section focuses on outcomes and responsibilities rather than tools.

Impact 01

Agentic AI & MCP Services

Designed and built core components of a Patient MCP Server — a FastAPI-based Model Context Protocol service that exposes structured clinical data to LLM-driven agents, with tool routing across multiple backends and multi-LLM evaluation across Claude, Gemini, and OpenAI.

Impact 02

Secure AI Service Delivery

Designed and operated internal services for model and provider access with strong access controls, secret management, audit logging, and automated credential rotation — enabling self-service use without losing governance.

Impact 03

API Gateway Modernization

Migrated the Azure OpenAI Services proxy from Apigee Edge to Apigee X, improving security posture, scalability, and operational monitoring for downstream AI workloads.

Impact 04

Infrastructure as Code That Standardizes Delivery

Developed reusable Terraform and Ansible modules and CI/CD patterns for internal platform programs that standardize application and AI service delivery, making platform work repeatable, reviewable, and easier to operate over time.

Impact 05

GPU / HPC Platforms Teams Can Use

Built self-service GPU and HPC environments using SLURM and GCP Managed Instance Groups, including MIG-based partitioning and distributed deployments via Google Cluster Toolkit to support scalable training and data processing workloads.

Impact 06

Observability for Reliability and Capacity

Integrated GPU telemetry and health signals (NVIDIA DCGM + Cloud Operations Suite) to support SLO-driven operations, faster debugging, and better capacity planning for AI and ML workloads.

Impact 07

Responsible AI in Regulated Environments

Built audit logging, reproducible pipelines, and HIPAA/PHI-aware workflows into AI services from day one — partnering with research, clinical, finance, and cloud operations teams to ensure designs meet security, compliance, and real-world usage requirements.

I lower the barrier to innovation by making “the right way” the easy way.

Strong defaults, automation, and observability reduce friction without sacrificing reliability or trust. When platforms provide clear guardrails and visibility, engineers spend less time fighting infrastructure and more time shipping.

That approach scales — in big-tech environments and regulated domains — because it is fundamentally about operational clarity.

Skills

The core technologies and systems I work with across cloud platforms, AI services, distributed compute, and ML infrastructure.

Cloud & Platform Engineering

  • Cloud architecture and operations on Google Cloud Platform (GCP)
  • Google Cloud Digital Leader 📄
  • Associate Cloud Engineer 📄
  • Infrastructure as Code and automation (Terraform, Ansible)
  • API management and traffic control (Apigee X, Cloud Run)
  • Compute orchestration with managed instance groups (MIGs) & autoscaling
  • Linux systems engineering, networking & scripting (Bash)
  • CI/CD pipelines and reusable infrastructure modules

Generative AI & Agents

  • Model Context Protocol (MCP) server design & agent tooling
  • Multi-LLM evaluation across Claude, Gemini, and OpenAI
  • Agent Development Kit (ADK) agents and prompt engineering
  • FHIR/HL7 clinical data integration with LLM agents
  • HIPAA/PHI-aware service design and audit logging

Software Engineering

  • Python as a primary language (tooling, services, ML)
  • API design & service development (FastAPI, REST, OpenAPI/Swagger)
  • Microservice architecture and backend systems
  • Internal tools & admin UIs (HTML, CSS, React)
  • Data storage & querying (BigQuery, MySQL)

Data & Machine Learning

  • Deep learning & classical ML (TensorFlow, PyTorch, scikit-learn)
  • Time-series modeling, forecasting, and anomaly detection
  • Multimodal biomedical and operational datasets
  • Metadata schema design and data quality monitoring
  • Signal processing techniques (PCA, ICA, filtering)

ML Compute & Observability

  • GPU platform management (MIGs, scheduling, monitoring)
  • SLURM cluster provisioning & lifecycle automation
  • GPU observability & performance monitoring (NVIDIA DCGM, Cloud Ops)
  • Distributed training & large-scale execution (Ray, Cluster Toolkit)
  • Containerization & environment packaging (Docker)
  • SLO-driven reliability and cost visibility for compute

Responsible AI & Compliance

  • HIPAA/PHI workflows and clinical data governance
  • Reproducible ML pipelines and audit logging
  • FAIR principles for data sharing and reuse
  • Secret management and credential rotation patterns
  • Cross-functional alignment across research, clinical, and ops teams

Experience

Software Engineer (Cloud Platform & AI Services)

Mayo Clinic · Generative Artificial Intelligence Program

Remote · 2024 to Present

I design and build cloud-native platform services on GCP for internal platform programs that standardize application and AI service delivery, developing reusable Terraform and Ansible modules and CI/CD patterns. I build secure backend services and admin UIs for managing model access and credentials, migrated an Azure OpenAI proxy from Apigee Edge to Apigee X, and develop GPU/HPC self-service capabilities using SLURM and Managed Instance Groups, paired with a GPU observability stack built on Cloud Operations and NVIDIA DCGM. I also contributed core components to a Patient MCP Server, a FastAPI-based Model Context Protocol service exposing clinical data (FHIR, BigQuery) to LLM-driven agents, with tool routing, LOINC/ICD code integration, and multi-LLM evaluation across Claude, Gemini, and OpenAI.

Research Engineer (Data Science)

Mayo Clinic · Multimodal Neuroimaging Laboratory

On-site · 2021 to 2024

I built high-throughput pipelines for diffusion MRI and intracranial EEG processing using Python, Linux, and distributed workflows. I led development of HED-SCORE, an open-source EEG metadata framework adopted by international research teams, and designed tools for multimodal data integration, structured metadata management, and quality control across large neuroimaging datasets in collaboration with clinicians and data scientists.

Research Engineer (Applied ML)

Mayo Clinic · Bioelectronics Neurophysiology and Engineering Laboratory

On-site · 2018 to 2021

I worked on ML for long-duration EEG and wearable sensor data, developing LSTM-based seizure prediction models, integrating real-time ML components into clinical-grade monitoring systems, and building ingestion, preprocessing, and feature extraction pipelines for large biosignal datasets in collaboration with neurology and engineering teams.

Research Technologist

Mayo Clinic · Bioelectronics Neurophysiology and Engineering Laboratory

On-site · 2016 to 2018

I prototyped closed-loop neurostimulation systems that combined hardware signals with real-time software control, and engineered backend tools for structured patient data tracking, automation, and biosignal analysis while supporting research teams with software development, data workflows, and database design for experimental studies.

Projects and Publications

Generative AI & Platform Work

Patient MCP Server

FastAPI · MCP · FHIR · BigQuery

Model Context Protocol service exposing clinical data to LLM-driven agents. Includes tool routing across backends, LOINC and ICD code integration, and multi-LLM evaluation across Claude, Gemini, and OpenAI.

OpenAI Service Account & Key Manager

FastAPI · React · GCP Secret Manager

Internal service for managing OpenAI service accounts and API keys. Integrates with GCP Secret Manager, includes audit logging and scheduled rotation, and provides a simple admin UI for platform and security teams.

Apigee X Migration for AI Proxy

Apigee X · Azure OpenAI

Migrated the Azure OpenAI Services proxy from Apigee Edge to Apigee X, improving security posture, scalability, and operational monitoring for downstream AI workloads.

GPU Observability Pipeline

Cloud Ops · NVIDIA DCGM

Unified GPU telemetry stack that collects metrics and health signals from GPU nodes and exposes them in dashboards and alerts. Enables teams to track utilization, debug issues, and plan capacity for AI and ML workloads.

Self-Service SLURM Cluster Provisioning

SLURM · GCP Managed Instance Groups · Cluster Toolkit

Automated system for creating and managing SLURM clusters on demand for research and production teams, using instance templates, Google Cluster Toolkit, and infrastructure automation to standardize configuration and simplify lifecycle operations.

Platform Modules for Application & AI Delivery

Terraform · Ansible · CI/CD

Reusable infrastructure modules and CI/CD pipelines supporting consistent deployment of cloud-native applications and AI-enabled services across internal platform programs.

Research, Publications, and Open Source

HED-SCORE — EEG Annotation Framework

Multimodal Neuroimaging Lab

Machine-readable metadata framework for EEG annotations that supports FAIR-aligned data sharing and large-scale analysis. Designed to make neurophysiology datasets easier to reuse across labs and tools.

View on Google Scholar

Multimodal Neuroimaging Pipelines

Multimodal Neuroimaging Lab

Python-based pipelines for diffusion MRI and intracranial EEG preprocessing and analysis. Built to support reproducible studies and collaboration between engineering and clinical research teams.

Algorithms for Time Series and Signals

Bioelectronics Neurophysiology and Engineering Lab

Models and algorithms for long-duration EEG and wearable sensor data, including LSTM-based seizure prediction, anomaly detection, forecasting of biomedical time series, and signal artifact reduction.

View on Google Scholar

Patents

Seizure Forecasting in Wearable Device Data Using ML

US20220359071A1 (Pending)

System for seizure risk forecasting using EEG and wearable data. Combines feature engineering, temporal modeling, and long-horizon prediction for real-world monitoring scenarios.

View on Google Scholar

Let’s Build Reliable AI Platforms

What I Bring

  • Production cloud-native platforms and AI services on GCP.
  • AI systems including MCP servers, agent tooling, and multi-LLM evaluation.
  • GPU-ready environments and SLURM-based clusters for distributed workloads.
  • Automation patterns (Terraform/Ansible/CI/CD) that standardize delivery across teams.
  • Observability and reliability for AI and ML systems (DCGM + cloud monitoring).
  • Security and governance practices suitable for HIPAA/PHI environments.

Where I Fit Best

  • Cloud Platform / Backend Engineering for AI workloads.
  • AI services engineering — MCP, agent design, tool integration, and evaluation.
  • ML Infrastructure / ML Platform Engineering.
  • Distributed compute & reliability roles (GPU/HPC, orchestration, observability).
  • Teams operating AI platforms in regulated or high-stakes domains.

Areas of Interest

Generative AI MCP & Agents Cloud Platforms ML Infrastructure GPU and HPC Observability Security & Governance Time Series ML Biomedical Data

If you’re building cloud platforms, AI services, or secure AI infrastructure — I’d love to connect.

Email Me