BIOMETRICS 55, 37-43 March 1999 A Model for Thermograms Sthphane Robin INA-PG/INRA, D6partement de Biometrie, 16, rue Claude Bernard, 75231 Paris cedex 05, France ernail:

[email protected] SUMMARY. Thermograms are curves resulting from thermal analysis and are of great interest in the study of various food and biological products physical properties. A method to separate underlying peaks is proposed, and statistical properties of estimates for some characteristic parameters are derived. The total number of peaks can be estimated with a sequential analysis of the residual plots. For each new peak, a statistical criterion is proposed to check whether it is significantly different from the noise of the recording. As an example, the method is applied to a summer milk fat fusion thermogram. KEY WORDS: Nonlinear regression; Peaks separation; Thermal analysis. 1. Introduction Thermal analysis summarizes information on the structural and rheological behavior of biological and food products as a function of the temperature. The results of this method are generally presented as a curve called a thermogram (or differential scanning calorimetry [DSC] recording) in which the horizontal axis value is the temperature and the vertical axis value is the enthalpy. A thermogram is essentially composed of several peaks, each representing a component of the considered product. The location of a peak, identifies the underlying component, and the area gives its proportion. A reliable separation method of the peaks in therefore necessary to derive satisfying quantita- tive information on each component from thermal analysis. Analysis of thermograms has remained essentially empir- ical because of the lack of statistical methods and models. The problem of the estimation of the areas is especially im- portant and very poorly solved with the common empirical methods. For complex thermograms with numerous or over- lapping peaks, as observed with milk fat DSC, even the loca- tion parameters are very inaccurately estimated. Ritter (1994) investigates a similar problem for curves ob- tained by electron spectroscopy. He proposes a Gaussian- Lorentzian model with a fixed number of peaks and uses a Bayesian estimation method to incorporate some auxiliary information in the model. In the end, he discusses the con- sequences of ignoring the prior information on the quality of estimates. The model proposed here allows separation of numerous peaks and gives very good approximation to the whole ther- mogram. Figure 1 shows an example of its application. Such a decomposition is of great help in inferring the composition of the product. Note that the model presented here is not a mixture model. The literature on mixtures is abundant (see, e.g., Roeder, 1996, or the general literature on classification) but is not suitable for our problem because we do not have a random sample issued from a mixture distribution. The data we deal with are a sequence of regularly spaced abscissas with corre- sponding random ordinates, so we use a nonlinear regression model. Although the model uses Gaussian peaks, it has noth- ing to do with any probability density function. Several models for a single peak are presented in Section 2, and a choice is made among them. The general model is then constructed when the number of peaks is known. The different steps of the parameters estimation are exposed in Section 3, and the statistical properties of the estimates are derived. A sequential method to determine the number of peaks is then proposed in Section 4, and Section 5 presents an application to a summer milk fat stearic fraction. 2. The Model 2.1 Model for One Peak Peak shape. The physical theory suggests that a peak is made of a linearly ascending part starting from a point called onset and an exponentially decreasing part. This description gives the so-called linear-exponential model h(t - o)/(s -- o) h exp{ -u(t - s ) } if t 5 s, otherwise, v(t; 0) = where 0 is the vector containing the characteristic parameters of the peak. Here 8 = (0, s, h, u) , where h denotes the height (i.e., the maximum) of the peak, s (summit) the temperature corresponding to this maximum, o the temperature of the onset, and u the exponential return rate to the baseline. In this model, the area of the peak is a = h{(s - o)/2 + u- }. 37 38 Biometrics, March 1999 C 0 r r e C t d e n e r -0.05 , , . . . . . . I . . . . , . . . . , . . . . , . . . . I . . . . I . . . . , . . . . I - 3 0 - 2 0 -10 0 10 2 0 3 0 40 5 0 60 Temperature legend: observed data (dots), fitted curve (solid line) and peaks (dashed lines) Figure 1. Eight-peak decomposition plot. Before proposing any other model, we will remark that, according to the physical theory, four seems to be the good number of parameters for one peak because the model must give the place where the peak stands (i.e., the location param- eter), its height, and the increasing and decreasing rates. A model with less parameters would necessarily imply assump- tions on the shape of the peak, such as symmetry. Another possible model is the Gaussian-exponential model, hexp{-A2(t - sl2/2) h exp{ -v(t - s ) } if t 5 s, otherwise. p(t; 6) = where 6 = (A, s, h,y) and y represents the thickness of the descending part. The different shapes obtained with these models can be seen on Figure 2. As we shall see in the following, the rupture at the top of the peak observed with the first two models is not very realistic. Comparison of the different shapes. To explain the choice of the single peak function p (-; 6), we compare the different models in the case of a pure product which is a reference among fatty acids: the lauric acid. Figure 2 presents the fitted curves obtained with the three models and shows that the hypothesis of a linear increasing part is a bad approximation, likewise the exponential decreas- Here 6 = (A, s, h, v ) where X represents the thickness of the ascending part of the peak: in this model, as observed on real thermograms, there is no discontinuity of the slope at the onset. ~~~~~. The last model proposed here is the Gaussian-Gaussian ing part. The different models are compared according to their residual sums of squares and estimated variances of errors (SSE and s2: precisely defined in next section). We see that the Gaussian-Gaussian model is more than three times better model, hexp{-A2(t - sl2/2) hexp{-y2(t - ~ ) ~ / 2 } otherwise, if t 5 s, p(t; 6) = Linear-exponent ial Gaussian-exponential Gaussian-Gaussian 4~~ 3 0 5 0 ~~ : IA 20 1 0 0 4 2 4 3 4 4 45 46 4 1 92 4 3 4 4 05 4 6 4 7 4 2 43 4 4 4 5 4 6 4 1 T T T SSE = 514.4 SSE = 347.3 SSE = 155.1 z2 = 0.865 C2 = 0.584 ii2 = 0.261 legend: observed data (dots), fitted curve (solid line) Figure 2. Model for the single peak of pure lauric acid. A Model for Themnograms 39 than the linear-exponential model and more than twice bet- ter than the Gaussian-exponential model. Many other exam- ples give similar results, we shall therefore use the Gaussian- Gaussian model in the following. 2.2 General Model We study here a nonlinear regression model for a thermogram with K peaks (K being known), Y(t; w ) = @(t; w) + E(t), where t is the temperature, w is the set of parameters, and the error terms E(t) are supposed to be independent and normally distributed. Baseline. During fusion, the properties of a product are modified because its components melt one after another. This modification implies that the slope of the baseline (which is related to the specific heat [C,] of the product) changes all along the thermogram. In common empirical methods, this baseline b(t) is approximated by a broken line, but this method gives bad results at the breaking points. We use here an approximated baseline made of two lines and a third-degree spline polynomial joining them: amint + cmin bo + blt + bzt2 + b3t3 if t tmax. The points tmin and tmax are chosen in the linear parts of the thermogram, respectively at the beginning and at the end. Coefficients (amin, c,in) and (amax, Cmax) are the slopes and intercept of the respective tangents in tmin and tmax and (bo, bi, b2, b3) are related to (amin, Cmin, amax, cmax) by the usual relations, ensuring the continuity of b(t) and b’(t). This expression gives a smoother (and more realistic) shape to the baseline. The introduction of the baseline in the model is briefly discussed in the conclusion. In the following, we consider that the baseline has been removed from the thermogram, so we study here the corrected enthalpy Y(t) + Y(t) - b(t). Model for K peaks. The particularity of this model is that the function @(.;w) is the sum of K functions cp(.;ek) each corresponding to one peak: K w is the set of parameters describing the whole thermogram. The underlying physical hypothesis is that the thermogram is the simple sum of the K peaks, each corresponding to one component. We will see in the following that, despite its simplicity, such a model gives very good results. Notice that this model is identifiable, as are Gaussian mixtures: There is a unique parameter w giving the curve y = @ (t;w) for all t. 3. Parameter Estimation In the following, for any parameter +, we denote + its true value, $ J its evaluation with an empirical method, +* thz starting value of the nonlinear regression algorithm, and + its maximum likelihood estimator. - 3.1 Nonlinear Regression: Estimation of the 81, ’s Let C be the curve to be modeled and K the number of peaks. C is made of n points: C = {(ti, yi), i = 1,. . . , n}. We use the maximum likelihood estimator, which, under the normal hypothesis, is the same as the least-squares estimator: i; minimizes SSE(u) = c ( y i - @(ti, u))~ i All computations are made with the NLIN procedure of the SAS software, generally with the Marquardt algorithm. This algorithm needs the derivatives of @(t;w) according to each element of w for any value of t. These derivatives have rather simple forms, except at the summits, where the derivatives according to Sk can be infinite, so we must assume that no point ti of the observed curve stands at the exact x value sk of the rupture of any peak. Data transformation. To avoid heteroscedasticity for E(t), we estimate the regression model Z(t) = log (1 + a (t; w)} + E(t), where Z(t) = log{l + Y ( t ) } instead of modeling directly Y(t); notice that the constant 1 is chosen arbitrarily. This transformation is coherent with the hypothesis that the variance of the error term E(t) increases with the signal Y(t) (see Seber and Wild, 1989; Huet, Jolivet, and MessBan, 1992). Confidence intervals. The nonlinear least-squares estima- tor is asymptotically unbiased and normal (see Gallant, 1987). Denoting as the estimator of w, we have fi(G - w) 2 N (0, I ( w ) - V ) , as n + +m, where I(w) is the Fisher’s information matrix and approximately (G - W ) - N (0, (F(w)’ . F(w)) ,-l 02) , where the general term of F(w) is Fik(w) =

[email protected](t,; w)/dwk. This giv? the Triances of the maximum likelihood estimators i;))’, give asymptotic confidence intervals for each parameter. = 3.2 Initialization ANATHERM software. The ANATHERM (thermograms analysis) software has been developed to exploit the results of the many differential thermic analyses of anhydrous milk fat. Here, this software is used to obtain the starting values w* needed by the nonlinear estimation algorithm. 40 Baometrics, ANATHERM computes characteristic elements of a curve, such as its derivative at any point, the greatest and smallest derivatives on a given interval, the area under the curve between two given points, and the x and y values of the maximum of the curve in a given interval. The derivatives are estimated with a weighted least-squares regression, and the weight function is user defined (Robin, Bourguinon, Lavigne, Ollivon, unpublished manuscript .) . Initial values. For each peak, the abscissa of the onset (6) and of the maximum ( S ) , the ordinate of the maximum O), they must be estimated using x, g, %, and T (index k is omitted for simplicity in notation). - Onset. According to physical practice, onset has been defined as the intersection of the greatest ascending tangent with the baseline. A Gaussian peak has no real onset, but this parameter is very useful in locating the significant beginning of the peak. The greatest ascending tangent is reached when d2’p(t; 8)/at2 = 0, i.e., when tasc = s- A-’; at this point, the slope of the tangent is dp(t; @)/at I t,,,= m h X - l , and this tangent intercepts the baseline at the abscissa o = s - 2X-l. The natural estimator of the onset is then hh - 0 = s - 2x-1 Area. The theoretical area value of a peak is a = J p(t; 8)dt = m h ( X - l + y-l) so that a natural estimator of the area is March 1999 Bias and variance approximation of z and 2 . The biases and variances of these estimators can be approximated with the A method using the second-order Taylor expansion of z and 2 about (A, s, h, y). The first- and second-order terms of the expansion give, respectively, the variance and the bias (see the Appendix). We must note that nonlinear least-squares estimators are only asymptotically unbiased and can be biased for finite samples, even for rather simple models, as shown in Box (1971). Therefore, the variances and biases expressions given in the Appendix are only approximations. In the example in Section 5, as in many others, the approximate biases of both and 2 remain negligible (see Table 1). Other possible parametrizations. The model could have been parametrized with the respective inverses A’ and y’ of X and y to obtain simpler forms for both a and o (a = h m ( X ’ + y’), o = s - 2X’); then $would be asymptotically unbiased and the variance and bias of 2 calculated with the A method would have much simpler forms. Despite the fact that this parametrization seems to be equivalent to the previous one, it leads to numerical instabilities, especially in case of very thin peaks, i.e., when A’ or y’ is very small. Another way of writing the model is to introduce directly the parameters a and o using X = 2(s - o)-’ and y = [ a f i + h(s - 0)/2]-l, but this parametrization also leads to condition number problems and to very slow convergence of the algorithm. The first parametrization is therefore maintained, and both o and a are estimated with the expressions of z and 2 given previously. 4. Determination of the Number of Peaks Iterative method. The initialization step shown previously is generally insufficient to find all peaks in complex thermograms because K is unknown. An easy way to exhibit new peaks is to study residual plots. After a K-peak regression model has been fitted to the curve, the residual plot z(K)(t) = y(t) - @(t, G(K)) can be analyzed as a fictitious thermogram in which new peaks appear. The initial value 8 13.41 p.119; 0.1211 [3.44; 3.731 10.99; 1.031 A A h A h = 0.120 o = 3.61 - - Estimated bias 1.05 10-~ 3.80 10-~ A Model for Thermograms 41 C 0 r r e C t s e n e r 4 Y 0 . 0 5 . . . . . . . . . . . . . . . . . . . .