Maintaining Feature Pipelines With Automated Resolution of Compute Failures is a critical aspect of ensuring uninterrupted access to machine learning features for customers relying on Tecton. The complexity of feature pipelines that compute these features increases with the number of subsystems, use cases, and customer base, leading to a wide range of failure scenarios. Transient failures can be resolved automatically, while permanent failures require manual intervention. A hybrid approach combining automatic and manual recovery steps is employed to triage failures, including increasing delays between retries, capping total retries, and sending alerts to users when the cap is reached. The system also provides a history of compute jobs to enable users to identify potential prevention measures, such as revoking data access permissions or tweaking ACL policies. In practice, the system has been successfully deployed with customers, allowing them to quickly intervene and correct issues without requiring support from Tecton's team.