Making Semgrep rip: How Ripgrep inspired us to shave hours off (some) scans
Blog post from Semgrep
Semgrep's file targeting step, crucial for filtering files against ignore patterns before scanning, experienced significant inefficiencies, taking hours in some cases due to millions of regex calls for large repositories. By replacing most regex lookups with string comparisons and building a hash table index, Semgrep drastically improved performance, reducing a customer's repo scan time from 7.5 hours to under 2 minutes. These changes, available in Semgrep 1.162.0, reduced the 99th percentile scan duration from nearly an hour to under 12 minutes. Semgrep supports various ignore patterns, including those from .gitignore and .semgrepignore, allowing customization of scan findings and optimizing scan times further by focusing on relevant files. The implementation of optimized matching strategies, inspired by Ripgrep, enabled most patterns to be evaluated with simple string comparisons, significantly decreasing regex calls. The improvements led to substantial speedups across Semgrep's customer base, especially for the most time-consuming scans, enhancing the overall efficiency and reliability of Semgrep's scanning process.