ML Infrastructure · Platform Engineering · Production AI Systems

Tal Pal Attia

Rochester, Minnesota

Lowering the barrier to innovation through practical ML infrastructure.

I am a software engineer at Mayo Clinic IT focused on machine learning infrastructure and internal platforms. I build GPU-ready compute, distributed execution environments, and cloud-native services that help teams train and deploy ML workloads reliably.

My work spans Terraform-based automation on GCP, SLURM-based GPU clusters, observability for ML systems, and secure API services designed for governance and auditability. I enjoy building platform capabilities that improve developer velocity while raising the reliability and safety bar by default.

Photo of Tal Pal Attia

ML Infrastructure and Platform Engineering

Software Engineer, Generative AI Program · Mayo Clinic IT

M.Sc. Biomedical Engineering and Physiology · Mayo Clinic Graduate School

B.Sc. Industrial Engineering and Management · Ben Gurion University

Platform Impact

The production capabilities I’ve designed, deployed, and operated.
This section focuses on outcomes and responsibilities rather than tools.

Impact 01

Secure AI Service Delivery

Designed and operated internal services for model/provider access with strong access controls, secret management, audit logging, and automated credential rotation — enabling self-service use without losing governance.

Impact 02

GPU / HPC Platforms Teams Can Use

Built GPU-ready environments with SLURM and GCP Managed Instance Groups, including MIG-based partitioning and scalable patterns for shared workloads, experiments, and distributed execution.

Impact 03

Infrastructure as Code That Standardizes Delivery

Developed reusable Terraform + Ansible modules and CI/CD patterns that make platform delivery repeatable, reviewable, and easier to operate over time.

Impact 04

Observability for Reliability and Capacity

Integrated GPU telemetry and health signals (NVIDIA DCGM + Cloud Monitoring) to support SLO-driven operations, faster debugging, and better capacity planning for ML workloads.

I lower the barrier to innovation by making “the right way” the easy way.

Strong defaults, automation, and observability reduce friction without sacrificing reliability or trust. When platforms provide clear guardrails and visibility, engineers spend less time fighting infrastructure and more time shipping.

That approach scales — in big-tech environments and regulated domains — because it is fundamentally about operational clarity.

Skills

The core technologies and systems I work with across ML infrastructure, distributed compute, and cloud engineering.

Cloud & Infrastructure

  • Cloud architecture and operations on Google Cloud Platform (GCP)
  • Google Cloud Digital Leader 📄
  • Associate Cloud Engineer 📄
  • Infrastructure as Code and automation (Terraform, Ansible)
  • Compute orchestration with managed instance groups (MIGs) & autoscaling
  • Serverless compute, API management, security services, and traffic management using GCP-native tools
  • Linux systems engineering, networking & scripting (Bash)

ML Platforms & Compute

  • GPU platform management (MIGs, scheduling, monitoring)
  • SLURM cluster provisioning & lifecycle automation (research & production)
  • GPU observability & performance monitoring (NVIDIA DCGM, Cloud Ops)
  • Distributed training & large-scale execution (Ray)
  • Containerization & environment packaging (Docker)

Software & Machine Learning

  • Python as a primary language (tooling, systems, ML)
  • API design & service development (FastAPI, REST)
  • Data storage & querying (BigQuery, MySQL)
  • Internal tools & lightweight UIs (HTML, CSS, React)
  • Deep learning & classical ML (TensorFlow, PyTorch, scikit-learn)
  • Time-series modeling & forecasting (biomedical & operational)
  • Signal processing techniques (PCA, ICA, filtering)

Experience

Software Engineer (ML Infrastructure)

Mayo Clinic · Generative Artificial Intelligence Program

Remote · Jan 2024 to Present

I build and operate ML infrastructure for the Generative AI Program, including GPU compute environments, SLURM automation, and distributed training support on GCP. I developed a GPU observability stack using Cloud Ops and NVIDIA DCGM to improve reliability and diagnostics, and maintain internal FastAPI services for secure model access and credential management integrated with GCP Secret Manager and infrastructure-as-code workflows using Terraform and Ansible.

Research Engineer (Data Science)

Mayo Clinic · Multimodal Neuroimaging Laboratory

On-site · May 2021 to Jan 2024

I built high-throughput pipelines for diffusion MRI and intracranial EEG processing using Python, Linux, and distributed workflows. I led development of HED SCORE, an open source EEG metadata framework adopted by international research teams, and designed tools for multimodal data integration, structured metadata management, and quality control across large neuroimaging datasets in collaboration with clinicians and data scientists.

Research Engineer (Applied ML)

Mayo Clinic · Bioelectronics Neurophysiology and Engineering Laboratory

On-site · May 2018 to May 2021

I worked on ML for long duration EEG and wearable sensor data, developing LSTM-based seizure prediction models, integrating real time ML components into clinical grade monitoring systems, and building ingestion, preprocessing, and feature extraction pipelines for large biosignal datasets in collaboration with neurology and engineering teams.

Research Technologist

Mayo Clinic · Bioelectronics Neurophysiology and Engineering Laboratory

On-site · May 2016 to May 2018

I prototyped closed loop neurostimulation systems that combined hardware signals with real time software control and engineered backend tools for structured patient data tracking, automation, and biosignal analysis while supporting research teams with software development, data workflows, and database design for experimental studies.

Projects and Publications

Infrastructure and Platform Work

OpenAI API Key Manager

FastAPI · React

Internal service for managing API keys and access to language model providers. Integrates with GCP Secret Manager, includes audit logging and scheduled rotation, and provides a simple admin UI for platform and security teams.

GPU Observability Pipeline

Cloud Ops · NVIDIA DCGM

Unified GPU telemetry stack that collects metrics and health signals from GPU nodes and exposes them in dashboards and alerts. Enables teams to track utilization, debug issues, and plan capacity for ML workloads.

Self Service SLURM Cluster Provisioning

GCP Managed Instance Groups

Built an automated system for creating and managing SLURM clusters on demand for research teams, using instance templates and infrastructure automation to standardize configuration and simplify lifecycle operations.

Research, Publications, and Open Source

HED SCORE - EEG Annotation Framework

Multimodal Neuroimaging Lab

Machine readable metadata framework for EEG annotations that supports FAIR aligned data sharing and large scale analysis. Designed to make neurophysiology datasets easier to reuse across labs and tools.

View on Google Scholar

Multimodal Neuroimaging Pipelines

Multimodal Neuroimaging Lab

Python based pipelines for diffusion MRI and intracranial EEG preprocessing and analysis. Built to support reproducible studies and collaboration between engineering and clinical research teams.

Algorithms for Time Series and Signals

Bioelectronics Neurophysiology and Engineering Lab

Development of models and algorithms for long-duration EEG and wearable sensor data, including LSTM-based seizure prediction, anomaly detection, forecasting of biomedical time series, and signal artifact reduction.

View on Google Scholar

Patents

Seizure Forecasting in Wearable Device Data Using ML

US20220359071A1 (Pending)

System for seizure risk forecasting using EEG and wearable data. Combines feature engineering, temporal modeling, and long horizon prediction for real world monitoring scenarios.

View on Google Scholar

Let’s Build Reliable AI Platforms

What I Bring

  • Production ML infrastructure and cloud-native platforms on GCP.
  • GPU-ready environments and SLURM-based clusters for distributed workloads.
  • Automation patterns (Terraform/Ansible/CI/CD) that standardize delivery across teams.
  • Observability and reliability for ML systems (DCGM + cloud monitoring).
  • Security and governance practices suitable for regulated environments.

Where I Fit Best

  • Senior ML Infrastructure / ML Platform Engineering.
  • Cloud Platform Engineering for ML + data workloads.
  • Distributed compute + reliability roles (GPU/HPC, orchestration, observability).
  • Teams operating ML platforms in regulated or high-stakes domains.

Areas of Interest

ML Infrastructure ML Platforms Cloud Native ML GPU and HPC Observability Security & Governance Time Series ML Biomedical Data

If you’re building ML infrastructure, platform tooling, or secure AI services — I’d love to connect.

Email Me