Software Engineer · ML Infrastructure and Platforms

Tal Pal Attia

Rochester, Minnesota

Lowering the barrier to innovation through practical ML infrastructure

I am a software engineer at Mayo Clinic IT focused on machine learning infrastructure and internal platforms. I build GPU ready compute, distributed training environments, and cloud native services that help teams train and deploy ML workloads reliably.

My work spans Terraform and GCP automation, SLURM based GPU clusters, observability for ML systems, and secure API services, with additional experience building models and algorithms for time series and biomedical signals. I like working on the infrastructure that makes ML teams faster, safer, and more effective.

View Experience Get in Touch

ML Infrastructure and Platform Engineering

Software Engineer, Generative AI Program · Mayo Clinic IT

M.Sc. Biomedical Engineering and Physiology · Mayo Clinic Graduate School

B.Sc. Industrial Engineering and Management · Ben Gurion University

Engineering Focus

I focus on core pieces of the ML infrastructure stack: GPU compute, orchestration, pipelines, metadata, and observability. My goal is to make these systems reliable and easy to use so ML and data science teams can ship work faster and with fewer operational surprises.

Focus 01

GPU and Compute Orchestration

Designing and operating GPU enabled environments using GCP, Managed Instance Groups, SLURM, and MIG based partitioning. Automating provisioning and configuration to support training, experiments, and shared research workloads.

Focus 02

Cloud and Platform Automation

Using Terraform, Ansible, and CI/CD to build repeatable infrastructure, internal services, and platform components for ML teams. Emphasis on clear patterns, security, and maintainability.

Focus 03

Data, Metadata, and Reproducibility

Building pipelines and schemas that keep datasets, experiments, and artifacts organized, especially for multimodal and time series workloads. Supporting FAIR principles and reproducible workflows across teams.

Focus 04

Observability and Reliability for ML

Integrating telemetry for GPUs, jobs, and services using NVIDIA DCGM and cloud monitoring. Exposing metrics and health signals that make it easier to debug issues, plan capacity, and keep ML systems healthy over time.

Skills

Core technologies and areas I work with across ML infrastructure, platforms, and cloud engineering.

Cloud & Infrastructure

Cloud architecture and operations on Google Cloud Platform (GCP)
Google Cloud Digital Leader
Infrastructure as Code and automation (Terraform, Ansible)
Compute orchestration with managed instance groups (MIGs) & autoscaling
Serverless compute, API management, security services, and traffic management using GCP-native tools
Linux systems engineering, networking & scripting (Bash)

ML Platforms & Compute

GPU platform management (MIGs, scheduling, monitoring)
SLURM cluster provisioning & lifecycle automation (research & production)
GPU observability & performance monitoring (NVIDIA DCGM, Cloud Ops)
Distributed training & large-scale execution (Ray)
Containerization & environment packaging (Docker)

Software & Machine Learning

Python as a primary language (tooling, systems, ML)
API design & service development (FastAPI, REST)
Data storage & querying (BigQuery, MySQL)
Internal tools & lightweight UIs (HTML, CSS, React)
Deep learning & classical ML (TensorFlow, PyTorch, scikit-learn)
Time-series modeling & forecasting (biomedical & operational)
Signal processing techniques (PCA, ICA, filtering)

Experience

Software Engineer (ML Infrastructure)

Mayo Clinic · Generative Artificial Intelligence Program

Remote · Jan 2024 to Present

I build and operate ML infrastructure for the Generative AI Program, including GPU compute environments, SLURM automation, and distributed training support on GCP. I developed a GPU observability stack using Cloud Ops and NVIDIA DCGM to improve reliability and diagnostics, and maintain internal FastAPI services for secure model access and credential management integrated with GCP Secret Manager and infrastructure as code workflows using Terraform and Ansible.

Research Engineer (Data Science)

Mayo Clinic · Multimodal Neuroimaging Laboratory

On-site · May 2021 to Jan 2024

I built high throughput pipelines for diffusion MRI and intracranial EEG processing using Python, Linux, and distributed workflows. I led development of HED SCORE, an open source EEG metadata framework adopted by international research teams, and designed tools for multimodal data integration, structured metadata management, and quality control across large neuroimaging datasets in collaboration with clinicians and data scientists.

Research Engineer (Applied ML)

Mayo Clinic · Bioelectronics Neurophysiology and Engineering Laboratory

On-site · May 2018 to May 2021

I worked on ML for long duration EEG and wearable sensor data, developing LSTM based seizure prediction models, integrating real time ML components into clinical grade monitoring systems, and building ingestion, preprocessing, and feature extraction pipelines for large biosignal datasets in collaboration with neurology and engineering teams.

Research Technologist

Mayo Clinic · Bioelectronics Neurophysiology and Engineering Laboratory

On-site · May 2016 to May 2018

I prototyped closed loop neuromodulation systems that combined hardware signals with real time software control and engineered backend tools for structured patient data tracking, automation, and biosignal analysis while supporting research teams with software development, data workflows, and database design for experimental studies.

Technical Principles

How I think about building infrastructure and platforms for machine learning teams.

Infrastructure as Product

Designing infrastructure and internal tools so they are easy to adopt, documented, and maintainable by teams beyond the original builders.
Favoring clear patterns and automation over one off solutions.

Strong Defaults and Automation

Using Terraform, Ansible, and CI/CD to enforce consistent environments and configuration.
Automating recurring operational tasks to reduce manual work and error risk.

Observability Before Problems

Adding metrics, logs, and health checks early so issues are easier to detect and debug.
Using NVIDIA DCGM and cloud monitoring to understand GPU and workload behavior.

Security and Compliance Mindset

Working within HIPAA and organizational policies around data, access, and logging.
Designing systems that respect audit, governance, and privacy requirements from the start.

I like making ML systems easier to run, reuse, and trust.

I build infrastructure that teams can understand and rely on without needing to know every internal detail. Good platforms and tools should make it easier to do the right thing by default and lower the barrier to trying new ideas.

In high-stakes domains like healthcare, clarity and explainability matter just as much as performance - reliable infrastructure helps people do things safely and confidently.

Projects and Publications

Infrastructure and Platform Work

OpenAI API Key Manager

FastAPI · React

Internal service for managing API keys and access to language model providers. Integrates with GCP Secret Manager, includes audit logging and scheduled rotation, and provides a simple admin UI for platform and security teams.

GPU Observability Pipeline

Cloud Ops · NVIDIA DCGM

Unified GPU telemetry stack that collects metrics and health signals from GPU nodes and exposes them in dashboards and alerts. Enables teams to track utilization, debug issues, and plan capacity for ML workloads.

Self Service SLURM Cluster Provisioning

GCP Managed Instance Groups

Built an automated system for creating and managing SLURM clusters on demand for research teams, using instance templates and infrastructure automation to standardize configuration and simplify lifecycle operations.

Research, Publications, and Open Source

HED SCORE - EEG Annotation Framework

Multimodal Neuroimaging Lab

Machine readable metadata framework for EEG annotations that supports FAIR aligned data sharing and large scale analysis. Designed to make neurophysiology datasets easier to reuse across labs and tools.

Multimodal Neuroimaging Pipelines

Multimodal Neuroimaging Lab

Python based pipelines for diffusion MRI and intracranial EEG preprocessing and analysis. Built to support reproducible studies and collaboration between engineering and clinical research teams.

Algorithms for Time Series and Signals

Bioelectronics Neurophysiology and Engineering Lab

Development of models and algorithms for long-duration EEG and wearable sensor data, including LSTM-based seizure prediction, anomaly detection, forecasting of biomedical time series, and signal artifact reduction.

Patents

Seizure Forecasting in Wearable Device Data Using ML

US20220359071A1 (Pending)

System for seizure risk forecasting using EEG and wearable data. Combines feature engineering, temporal modeling, and long horizon prediction for real world monitoring scenarios.

View on Google Scholar

Links and Profiles

Professional

Connect with me to discuss roles or collaborations related to ML infrastructure, platforms, and cloud native ML systems.

Open LinkedIn Profile

Research

Google Scholar

Browse my publications and patent work related to seizure prediction, EEG metadata, and multimodal neuroimaging.

View Google Scholar

Exploring What’s Next in ML Infrastructure

What I Bring

Architecting ML infrastructure and cloud-native platforms.
Designing GPU-ready training environments and SLURM clusters.
Developing automation tools that improve workflow efficiency and platform reliability.
Improving observability, reliability, and performance of ML systems.
Building reproducible pipelines, metadata systems, and internal tools.

Roles Where My Experience Fits

Senior ML Infrastructure or ML Platform Engineering roles.
Senior Cloud Engineering roles supporting large-scale ML and data workloads.
Senior engineering roles at the intersection of ML systems, distributed compute, and internal platform development.

Areas of Interest

ML Infrastructure ML Platforms Cloud Native ML GPU and HPC Time Series ML Biomedical Data

I am focused on ML infrastructure and cloud ML engineering, and open to learning about teams working in these areas where my experience may be a good fit.

Email Me