The properties of sparse L1 pursuits were initially presented in a flurry of papers by David Donoho and his former student Emmanuel Candes. Candes had done an experiment in which he attempted to fit random signals with a combination of Fourier series and impulse spikes. The results were far better than expected and led both to an intense effort spanning most of 2004.
The best summary of the importance of their work is stated in the introduction to this paper:
https://statweb.stanford.edu/~donoho/Reports/2004/l1l0EquivCorrected.pdfUnless you enjoy mathematics a lot, I suggest just reading the introduction. Why it works is not as important as that it does work and that software for applying the technique is readily available.
What Donoho proves in the paper cited above is that a least summed absolute error (L1) solution to an ordinary system of linear equations Ax=y is exactly equivalent to the optimal (L0) solution which is NP-hard and can only be obtained by exhaustive search. Previously, the only known solution to such problems was to try every possible combination of the columns of A. For significant problems this is impossible. On the fastest computer available it would take until all the stars in the universe burned out.
This has profound implications for a great many applications. In the context of metrology, it permits solving problems which account for all the known and characterized system errors. In the case of errors due to quantum effects which can only be described statistically, it provides tight bounds on the irreducible error.
So if we parameterize the system output of a voltage reference and the associated buffer amplifier with an expression such as:
expected_value = vref_initial_value + ref_aging + vref_1/f_noise + vref_temp + amp_initial_offset + amp_aging + amp_temp + r1_initial + r1_aging + r1_temp + r2_initial + r2_aging + r2_temp + r3_initial + r3_aging +r3_temp
Given reasonable approximation of the values in the summation as arbitrary functions of measured parameters and unknown coefficients, we can compute the functions for a large number of possible coefficients.over a wide range. A problem with 10,000 choices for each term is solvable in a matter of a few minutes by means of linear programming using the simplex method. Faster algorithms based on the properties of regular polytopes in N dimensional space exist. But the simplex method, invented by Dantzig to solve optimization problems in operations research does an excellent job and high quality FOSS software is readily available.
If one computes 10,000 possible instances for each of the terns above as functions of measured values such as temperature, age, etc, there are 10,000**16 possible combinations. An L0 solution is computationally intractable. It can't be done. But an L1 solution, if it exists, has been proven to be identical if the columns of A are uncorrelated in all combinations and the x vector in x of Ax-y are sparse, that is, most of the coefficients of x are zero.
To put this in concrete terms, suppose we know from testing that the value of a resistor is f(a,b,c) where:
a=initial value
b= change due to ambient temperature
c=change due to current flowing through the resistor
but that we do not know a, b or c. All we know is that the change due to ambient temperature is some function g(b) and the change due to heating by the current is h(c). If we know the form of h(c) as say a polynomial h(c) = k0 + k1*c + k2*((c-d)**2) from measuring a number of resistors where k0, k1 and k2 are all vary with each device, then we can evaluate a large number of values of h(c) for a range of values of k0, k1 and k2. All of these are added to the A matrix as possible answers. The "dictionary" in the terms of sparse pursuits.
So what one does in practice is create models of the functional forms of the various terms. One then builds a massive dictionary of all the possible values for each term as a function of the known values. Unknown values are simply varied over a range of expected values and the expression evaluated. In the example cited, with 10,000 instances for each term the A matrix would consist of 160,000 columns and as many rows as one had measurements. When i was in grad school 30 years ago and are computers consisted of an 11/780 and a MicroVAX II solving such a problem was not possible even if we had known how. Now it's no harder than computing a long FFT on a desktop machine.