Stability-based validation of dietary patterns obtained by cluster analysis

Background Cluster analysis is a data-driven method used to create clusters of individuals sharing similar dietary habits. However, this method requires specific choices from the user which have an influence on the results. Therefore, there is a need of an objective methodology helping researchers in their decisions during cluster analysis. The objective of this study was to use such a methodology based on stability of clustering solutions to select the most appropriate clustering method and number of clusters for describing dietary patterns in the NESCAV study (Nutrition, Environment and Cardiovascular Health), a large population-based cross-sectional study in the Greater Region (N = 2298). Methods Clustering solutions were obtained with K-means, K-medians and Ward’s method and a number of clusters varying from 2 to 6. Their stability was assessed with three indices: adjusted Rand index, Cramer’s V and misclassification rate. Results The most stable solution was obtained with K-means method and a number of clusters equal to 3. The “Convenient” cluster characterized by the consumption of convenient foods was the most prevalent with 46% of the population having this dietary behaviour. In addition, a “Prudent” and a “Non-Prudent” patterns associated respectively with healthy and non-healthy dietary habits were adopted by 25% and 29% of the population. The “Convenient” and “Non-Prudent” clusters were associated with higher cardiovascular risk whereas the “Prudent” pattern was associated with a decreased cardiovascular risk. Associations with others factors showed that the choice of a specific dietary pattern is part of a wider lifestyle profile. Conclusion This study is of interest for both researchers and public health professionals. From a methodological standpoint, we showed that using stability of clustering solutions could help researchers in their choices. From a public health perspective, this study showed the need of targeted health promotion campaigns describing the benefits of healthy dietary patterns. Electronic supplementary material The online version of this article (doi:10.1186/s12937-017-0226-9) contains supplementary material, which is available to authorized users.


Background
In recent years, the dietary patterns (DP) approach has been used extensively to describe overall eating patterns in populations. In the literature, the most famous methods for computing dietary patterns are cluster analysis (CA) and principal component analysis (PCA). However, both methods describe diet in quite different ways. In PCA, continuous factors are defined based on correlations between dietary intakes and each individual has a score for all derived factors [1]. However, an individual's DP is difficult to interpret as it is described by a score on several factors [2]. On the other hand, cluster analysis separates individuals into mutually exclusive groups (clusters) based on similarities between their diets. Compared to factors, individual DP are easier to interpret since individuals are assigned to one cluster only.
One major challenge in using cluster analysis is that the obtained solution strongly depends upon the choices made by the investigator. Among them, the choice of the clustering method and the optimal number of clusters are particularly important [3]. Indeed, since different clustering methods make different assumptions about the structure of the data, the choice of the method should be done according to the group structure expected. However, researchers do not have any prior knowledge about the structure of the clusters and their number. As a result, it appears that researchers run different clustering methods with different number of clusters and tend to present the best interpretable solution [1,2,4]. Obviously, this solution may not be the best representative of dietary patterns in a population. As an alternative, some studies used indices measuring distances between clusters [3,[5][6][7]. However, since those indices assume a group structure, their use should be avoided when group structure is unknown [8].
Consequently, researchers need a method allowing objective selection of the most appropriate clustering method and number of clusters for describing their data. Lange et al. introduced an objective criterion to compare different clustering solutions and to choose the most appropriate [8]. This criterion measures the goodness of clustering solutions by assessing their stability. A stable clustering solution should be similar to solutions computed on other data sets drawn from the same source. The idea is that clustering solutions exhibiting higher stability are likely to be more appropriate for describing the data.
Therefore, the primary objective of this study was to test such objective procedure to select the optimal clustering method and the number of clusters describing dietary patterns, based on data from the interregional, crosssectional population-based NESCaV study (Nutrition, Environment and Cardiovascular Health). For simplicity, we decided to limit its application to traditional clustering methods used in the field of dietary pattern analysis, namely K-means, K-medians and Ward's minimum variance. Secondly, description of the selected clustering solution and relationships with nutrients intakes, socio-demographic, lifestyle and cardiovascular risk factors (CVRF) were presented. Finally, a comparison was also made with PCA factors.

Methods
Details concerning the NESCAV study have been presented previously [9][10][11]. Briefly, it is the first crossborder cardiovascular health population-based study, based on a stratified random sample of 3133 subjects, aged 18-69 years, recruited from three neighboring regions, namely Grand-Duchy of Luxembourg, Wallonia in Belgium, and Lorraine in France, constituting an important segment of the Greater Region population. Periods of recruitment were 2007 to 2008 for Grand-Duchy of Luxembourg and 2010 to 2011 for Wallonia and Lorraine. Pregnant women, people living in institutions, subjects outside the age range 18-69 years and those deceased before recruitment were excluded [10]. Sample sizes were computed in order to be able to estimate prevalence of cardiovascular risk factor with a level of confidence of 95% and a precision of 1%.
A 134-food frequency questionnaire (FFQ) was used to assess dietary intakes. Description and validation of this questionnaire have been detailed elsewhere [12,13]. To facilitate the analysis, the 134 food items were merged into 45 broader food groups according to their similarities (unpublished observations). Daily food intakes were computed as the product of daily frequency of consumption and the amount consumed. Considered cardiovascular risk factors (CVRF) were body mass index (BMI, kg/m 2 ), waist to hip ratio (WHR), systolic blood pressure (SBP, mmHg), diastolic blood pressure (DBP, mmHg), fasting plasma glucose (FPG, mg/dl), glycated haemoglobin (HbA1c, %), low-density lipoprotein cholesterol (LDL, mg/dl), high-density lipoprotein cholesterol (HDL, mg/dl) and triglycerides (TG, mg/dl). Information on treatment for hypertension, diabetes and dyslipidaemia was also gathered. Collected lifestyle behaviours were smoking status and level of physical activity expressed as weekly energy expenditure in metabolic equivalent task minutes per week (METs min/week), based on self-reported data from the International Physical Activity Questionnaire (IPAQ) [14,15]. Specific inclusion criteria for this particular study were also defined. Flowchart of participants who met inclusion criteria were described (see Additional file 1: Figure S1). First, 138 participants with non-reliable reporting in the FFQ and outlying values on nutrient intakes were excluded. Then, since relationships between dietary habits and CVRF may be biased by participants who had a serious cardiovascular event (n = 327) and/or who are under diet (n = 312), those individuals were excluded. In addition, participants who were not fasting at time of blood collection (n = 58) were also discarded. Thus, the final sample entailed 2298 individuals.

Statistical analysis
Transformation of the data Firstly, food groups and nutrient intake were adjusted for energy intake using the residuals methods of Willet and Stampfer [16]. Secondly, since extreme values may have a significant effect on clustering solutions, extreme intakes above six standard deviations were truncated [17]. Of the 103,410 available intakes, only 294 (0.28%) were truncated. Thirdly, since food intakes with large scales tend to have a larger effect on clustering solutions, food intakes were standardized by subtracting the minimum intake and then dividing by the range [3,18].

Formalization of cluster analysis
Let X = (X 1 , …, X n ) be the dataset of n = 2298 individuals to be clustered where X i is a 45-dimensional vector containing the 45 standardized food group intakes of the i-th individual. A clustering algorithm A with a predefined number of cluster k constructs a solution Y of the data set X into k clusters (Y := A k (X)). This solution Y is represented by an n-dimensional vector of labels Y = (

The measure of stability
Cluster stability exploits the fact that when multiple datasets are sampled from the same distribution, the clustering algorithm is expected to behave in the same way and produce similar results. Based on this idea, Lange et al. introduced a stability measure computed on the comparison of solutions obtained on different datasets drawn from the same source [8]. This stability measure was then compared across clustering methods and numbers of clusters to select the model associated with the most stable solution. Since this concept and its use in practice were previously described in detail [8], the method is only summarized below.
Briefly, considering a solution Y := A k (X), the method consists in assessing its stability by randomly splitting the data X into two independent half sets X tr (training dataset) and X te (test dataset), and comparing the solutions obtained for these halves (Y tr := A k (X tr ) and Y te := A k (X te )). However, since dataset X tr and X te are disjoint, clustering solutions are not directly comparable. To make these solutions comparable, a solution transfer mechanism allows extension of the clustering solution Y tr of the dataset X tr to the dataset X te . Technically, the training dataset (X tr ,Y tr ) is used to construct a classifier ɸ which is then used to predict label of individuals from the test sample X te . Consequently, the two clustering solutions A k (X tr ) and A k (X te ) are made comparable by comparing ɸ (X te ) and A k (X te ). The stability measure between the two solutions is then computed as the empirical misclassification rate [8]. Lower misclassification rates indicate higher stability.
In order to reduce the effect of random splitting, the algorithm was repeated 20 times and the estimates of stability for a given solution were computed as the average of the 20 corresponding estimates. The highest estimate of stability indicates the optimal clustering method and number of clusters. Clustering methods considered were the Ward's minimum variance, K-means and Kmedians and number of clusters k varying from 2 to 6. Since K-means and K-medians may return a local optimum, algorithms were always run 1000 times with different random starting seeds, and the solution that had the minimum total within-cluster sum of squares distances was selected. Concerning the choice of the classifier ɸ , since we want to measure the stability of clustering solutions, the influence of the classifier should be minimized. For this purpose, Lange suggested choosing a classifier using the same clustering method's grouping principle [8]. Therefore, K-nearest-means classifier was used when K-means and Ward's methods were assessed whereas the K-nearest-medians classifier was used for the K-medians algorithm. Moreover, as a sensitivity analysis for assessing the impact of the stability indices used, others measures, namely Cramer's V and Adjusted Rand index (ARI) were also computed. Contrary to the misclassification rate, higher values on Cramer's V and ARI indicate higher stability.

Description of dietary patterns
According to the stability indices values, the optimal clustering method for describing dietary patterns in our dataset was K-means with a number of clusters equal to 3. Clusters were described with mean of daily food intakes relative to corresponding overall mean intake. Cluster names were assigned based on food groups with high consumption. Clusters were also presented according to nutrient intake, socio-demographic and lifestyle factors. Continuous variables were presented as mean ± standard deviation (SD). Since most of the variables describing food and nutrient intake were not normally distributed, differences across clusters were evaluated using Kruskal-Wallis test. Categorical variables were presented as percentages (%) and differences were tested by the Chi-square test. A multinomial logistic regression was run to assess the relationships between clusters (dependent variables) and all socio-demographics and lifestyle characteristics as independent variables. Finally, separate multivariable-adjusted regression models for each CVRF (dependent variables) were also used to assess relationships with clusters (independent variables). Models were adjusted for gender, age, educational level, smoking status and the level of physical activity and medication use for the corresponding CVRF. Interaction between DP and gender were tested and if significant, results were stratified by gender. In order to take into account the sampling design of the study, individuals were weighted by the reciprocal of the probability of selection. All analyses were conducted with SAS version 9.4 (SAS Institute Inc., Cary, NC, USA). Ward's method was performed with the procedure PROC CLUSTER and K-means and K-medians with the procedure PROC FASTCLUS. P-values < 0.05 were considered as significant.

Comparison with PCA-DP
Continuous dietary patterns were also computed with PCA method. PCA-DP scores were calculated as a sum of the food intake variables weighted by the loadings generated by the method. Food groups with absolute loadings values superior to 0.2 were considered as contributing highly to the pattern [2]. According to the elbow method, three dietary patterns were selected. Both methods PCA and cluster analysis were compared by comparing means of PCA-DP across clusters with the Kruskal-Wallis test.

Results
Choice of clustering method and number of clusters Figure 1 presents the distribution of the three stability indices across clustering methods and number of clusters. Distributions were described with box-plots and average values computed on 20 repetitions of the algorithm. Regardless of stability indices and number of clusters, more stable solutions were obtained with K-means. In addition, the most stable solution was obtained with 3 clusters. Therefore, dietary patterns were computed with K-means algorithm and a predefined number of clusters equal to three.

Dietary patterns
The description of each cluster is given in Table 1. Clusters were described with mean of daily food intakes relative to corresponding overall mean intake. The cluster labelled "Prudent" was characterized by high intakes of brown bread, fruits, oleaginous fruits, dried fruits, soups, vegetables, pulses, preserved vegetables, offal, fish, smoked and canned fish, shellfish and mussels, dairy products, soya products, olive oil, oil-rich in omega 3 or 6, water and tea. In contrast, individuals in this cluster had low intakes of white bread, pastries, rice and pasta, fried foods, lean and fatty meat, processed smoked meat, processed meat, ready meals, minarine and margarine, fresh cream and dressing, sugar and sweets, salty biscuits, soft drinks, diet soft drinks, beer and aperitifs and spirits. Concerning the "Non-Prudent" cluster, individuals in this cluster consumed less cereals, rice/pasta, fruits, oleaginous fruits, dried fruits, vegetables, pulses, preserved vegetables, fish, smoked and canned fish, dairy products, soya products, olive oil and oil-rich in omega 3 or 6, light fresh cream and dressings, sugar and sweets, water, fruit or vegetable juice and tea. In contrast, the "Non-Prudent" cluster had high intakes of white bread, potatoes, fried foods, lean and fatty meat, offal, processed meat, shellfish and mussels, minarine and margarine, fresh cream and dressings, coffee, diet soft drinks, beer and wine. Finally, the "Convenient" cluster was characterized by consumption of convenient fast foods that require little preparation like cereals, pastries, rice and pasta, preserved vegetables, smoked and canned fish, ready meals, high-fat dairy products, soya products, fresh cream and dressings, sugar and sweets, salty biscuits, fruit or vegetable juice, soft drinks and aperitifs and spirits. In contrast, individuals in this cluster had low consumption of brown bread, potatoes, oleaginous  fruits, soups, vegetables, pulses, offal, fish, shellfish and mussels, oil-rich in omega 3, coffee and wine. The distribution of dietary patterns is also described in Table 1. The "Convenient" pattern was the most prevalent with 46% of the population assigned to this cluster. The remaining two clusters were smaller with 25% and 29% of the population belonging respectively to the "Non-Prudent" and "Prudent" cluster.
The description of dietary patterns according to nutrient intake is presented in Table 2. "Prudent" cluster was characterized by high intakes of all micronutrients, carbohydrates, total fiber and plant protein. In contrast, this cluster was associated with low intakes of alcohol, animal protein, added sugar and dietary cholesterol. Concerning fat profile, individuals in this cluster have higher MUFA: SFA (Ratio of monounsaturated fat to saturated fat) and PUFA: SFA (Ratio of polyunsaturated fat to saturated fat). On the opposite, "Non-Prudent" cluster had the highest intakes of alcohol, animal protein, and dietary cholesterol. It was also characterized by low intakes of carbohydrates, total fibre, added sugar, fat and all micronutrients and low MUFA: SFA and PUFA: SFA ratios. The "Convenient" pattern was associated with high intakes of carbohydrates, added sugar and fat and low intakes of alcohol, total fiber, plant and animal protein, β-carotene, vitamin E and iron.

Association of DP with sociodemographic and lifestyle characteristics
The associations of DP with sociodemographic and lifestyle characteristics are shown in Table 3. "Non-Prudent" and "Convenient" clusters were compared to the "Prudent"  cluster which was considered as the reference. Older subjects were less likely to adopt a "Convenient" pattern (OR = 0.92 [0.91; 0.93]). Indeed, individuals in the "Convenient" cluster were much younger (36.9 years) than those in the "Prudent" (49.3 years) and "Non-Prudent" (48.9 years) cluster. Men were also more likely to adopt a "Convenient" (OR = 2.2 [1.6; 3.1]) or "Non-Prudent" (OR = 4.2 [2.9; 5.9]) patterns rather than a "Prudent" one. Likewise, individuals with less education were also more likely to adopt a "Non-Prudent" pattern. Concerning the region, compared to Lorraine, individuals living in Luxembourg were more likely to adopt a "Convenient" (OR = 1.7 [1.2; 2.4]) or a "Non-Prudent" (OR = 2.1 [1.3; 3.4]) pattern. The difference was even larger when comparing with individuals living Wallonia with a net preference for the "Non-Prudent" (OR = 7.1 [4.5; 11.4]) and "Convenient" (OR = 2.7 [1.8; 3.9]) pattern. In details, 41% of individuals living in Lorraine adopted a "Prudent" pattern whereas they were only 28.7% in Luxembourg and 19.1% in Wallonia. On the opposite, only 14% of individuals in Lorraine adopted a "Non-Prudent" pattern whereas they were 19.9% in Luxembourg and 36.7% in Wallonia. Concerning lifestyle factors, smokers were more likely to adopt a "Non-Prudent" (OR = 3 [1.9; 4.7]) pattern. Regarding physical activity, individual in the "Convenient" cluster were engaged in significantly less physical activity (OR = 0.993 [0.988; 0.998]).

Association of DP with CVRF
Multivariate-adjusted β-coefficients for CVRF according to DP are displayed in Table 4. Compared to the "Prudent" pattern, higher BMI was noticed in individuals who adopted the "Convenient" and the "Non-Prudent" pattern whereas higher WHR was only observed in men having adopted the "Non-Prudent" pattern. "Non-Prudent" and "Convenient" patterns also showed higher SBP and DBP values. Concerning diabetes, "Convenient" and especially "Non-Prudent" patterns were significantly associated with higher FPG but not HbA1c. Regarding cholesterol levels, "Non-Prudent" cluster was associated with higher LDL and HDL in men only. Further adjustment of treatment did not change the results.

Comparison of dietary patterns obtained with PCA and Kmeans
Continuous dietary patterns were computed using the PCA method. According to the scree-plot, three dietary patterns were selected. The percentage of variance explained and loadings of food groups on DP are presented in Table 5. The three PCA-patterns accounted for 7.1% Table 3 Associations of clusters with sociodemographic and lifestyle characteristics (Mean(SD); Percentage) and odds-ratios) Sociodemographic  (3.1%, 2.1% and 1.9% respectively) of the total variance in food intakes. The first pattern was labelled "Prudent" as it was characterized by high intakes of fruits, oleaginous and dried fruits, soups, vegetables, pulses, fish, lowfat dairy products, soya products, olive oil, oil-rich in omega 6, water and tea and low intakes of fried foods, lean and fatty meat, processed meat, ready meals, minarine and margarine, fresh cream and dressing, salty biscuits, soft drinks, diet soft drinks and beer. The second PCA-pattern was named "Animal protein and alcohol" since it was positively associated with vegetables, pulses, all kinds of meat and fish and alcohol beverages and negatively associated with sugar and sweets, high-fat dairy products, pastries and cereals. The third pattern was labelled "Convenient" since this pattern was positively correlated with convenient foods that require little preparation like brown bread, cereals, rice, pasta, smoked and canned fish, shellfish and mussels, ready meals, low-fat dairy products, soya products, fresh cream and dressings, salty biscuits, fruit or vegetable juice. Moreover, it was also negatively correlated with white bread, potatoes and butter. Comparison of dietary patterns obtained through PCA and K-means are shown in Fig. 2. The three clusters were similar to the three continuous dietary patterns obtained through PCA. Indeed, the PCA-Prudent DP was highest in the "Prudent" cluster, the PCA-animal protein and alcohol DP was highest in the "Non-Prudent" cluster and the PCA-convenient DP was highest in the "Convenient" cluster.

Discussion
The main objective of this study was to test a method allowing the objective selection of the most appropriate model among different clustering methods and numbers of clusters in the field of "dietary pattern analysis." The idea was to assess stability of different clustering solutions and choose the most stable solution as the most appropriate for describing the data. According to this method, three dietary patterns obtained with K-means algorithm were obtained. The "Non-Prudent" and "Convenient" patterns associated respectively with non-healthy food choices and convenient foods were both associated with a higher cardiovascular risk compared to the "Prudent" cluster characterized by healthier dietary habits and lower cardiovascular risk. Among the clustering method considered in this article, K-means clearly showed more stable solutions regardless of the number of clusters. However, it is highly likely that other more sophisticated methods would have been more appropriate [3]. Indeed, clustering methods considered in this study were really simple and others methods with higher flexibility regarding cluster's characteristics are more likely to identify real complex structure. In addition, although K-means was found as the most appropriate method for describing dietary patterns in adults living in the Greater region, it may not be the case with other datasets. Indeed, group structures from other populations are likely to be different. Therefore, this should be explored in additional datasets across different populations.
In the field of dietary pattern analysis, we are aware of only two studies comparing different clustering methods. Like us, Lo Siou et al. assessed stability of solutions obtained with different clustering methods and number of clusters and also showed that K-means was the most appropriate method [3]. Contrary to our results, stability decreased with the number of clusters and therefore they were not able to identify an optimal number of clusters with this method. In addition, as proposed by Lange [8], the authors also used a classifier to transfer the solution obtained on one sample to another. However, the classifier should use the same clustering method's grouping principle. In accordance with Lange, we used the nearest-means classifier for K-means and Ward's method and the nearest-medians classifier for Kmedians. However, Lo Siou et al. used the nearestneighbour classifier for K-means and Ward's method. In order to assess the effect of using a not optimal classifier, we compared stability indices computed on our data with optimal classifiers and the not optimal nearestneighbour classifier. All stability indices were lower when the not optimal nearest-neighbour classifier was used (see Additional file 2: Figure S2). Therefore, stability values computed in the paper of Lo Siou et al. may have been underestimated.
In another study, Greve et al. use an inappropriate manner for choosing the optimal number of clusters [19]. Indeed, they selected as the optimal number of cluster the number maximizing the agreement between different clustering methods. However, agreement between methods is conditioned by the capability of methods to identify cluster's structure. Indeed, if a method is not able to distinguish clusters, it will never agree with another method even for the correct number of clusters. Therefore, although it is reassuring to have good agreement between solutions obtained with different algorithms, agreement should not be used for choosing the optimal number of clusters.

Comparison with others studies
In accordance with the literature [2,20], we also derived a "Prudent" dietary pattern characterized by plenty of plant foods and fish and a preference for vegetable oils and lowfat dairy products. In contrast, similar to Western DP described in others studies [2,20], we also derived a "Non-Prudent" pattern characterized by intakes of red and processed meats, high fat content foods, refined grains, soft drinks and alcoholic beverages [21,22]. However, contrary to most Western-DP described in the literature [2,20], our "Non-Prudent" pattern was not associated with intakes of sweets and sugar.
Similar to some studies [4,23,24], we also found a cluster characterized by consumption of convenient fast foods. It showed high intakes of convenient unhealthy foods like pastries, ready meals, high-fat dairy products, fresh cream and dressings, sugar and sweets, salty biscuits, soft drinks and aperitifs and spirits. However, it was also characterized by high intakes of convenient  Loading values superior to 0.2 or inferior to -0.2 were in bold healthy foods like cereals, preserved vegetables, smoked and canned fish, soya products, fruit or vegetable juice. Regarding nutrients, this pattern was associated with high intakes of carbohydrates, added sugar and fat. The size of DP showed that the "Convenient" pattern was the most prevalent with 46% of the population assigned to this cluster. The "Prudent" and "Non-Prudent" patterns were adopted by 29% and 25% of the population respectively. However, striking differences were noticed across regions. Although, the "Convenient" pattern was the most adopted in all regions, the "Prudent" pattern was more frequent in Lorraine (41%) than in Luxembourg (28.7%) and Wallonia (19.1%). In sum, only a small part of the population has healthy dietary habits and this part is even smaller in Luxembourg and Wallonia. The adoption of a "Convenient" pattern may be due to the fact that people have less and less time for preparing and cooking foods and thus choose to consume prepared foods.
In line with others studies, we also found significant associations between dietary patterns and sociodemographic and lifestyle characteristics. We found that the "Convenient" pattern was more likely to be adopted by men and younger people [23]. Since the Luxembourg population is made up of more young active working people, this might explain the larger size of the "Convenient" cluster in Luxembourg compared to Wallonia and Lorraine [25]. In addition, as also shown by other studies [4], women and individuals with higher education were more likely to adopt a "Prudent" pattern. Moreover, in accordance with others studies [2,4], we also found that people who choose unhealthy dietary habits are less likely to be engaged in healthy behaviours like doing physical activities and not smoking. It shows that the choice of a dietary pattern is in fact part of a larger pattern of lifestyle.
Concerning association with CVRF, we found that "Convenient" and "Non-Prudent" patterns were associated with higher BMI, WHR, SBP, DBP and FPG [4,23,26]. Moreover, the "Non-Prudent" pattern was also associated with higher HDL and LDL levels in men only. It is in accordance with others studies which also found that a cluster dominated by alcohol was directly associated with HDL [27][28][29]. The fact that the association was significant in men only might be explained by different level of alcohol consumption between men and women. Indeed, when clusters were described by gender, we observed that the "Non-Prudent" cluster was characterized by high intakes of alcohol in men but not in women (data not shown). Another explanation could be a different effect of diet on plasma lipids between men and women, possibly due to hormonal and sex differences in  [2,30,31]. Moreover, the genetic variation in lipoprotein metabolism may also have an effect [32].

Comparison between PCA and cluster analysis
Despite clear differences in approaches and interpretation, PCA and cluster analysis gave similar results. A "Prudent" DP was identified with both methods. Indeed, a "Prudent" and "Non-Prudent" cluster with respectively high and low values on PCA-Prudent DP were found. Likewise, a convenient cluster was made of individuals with high values on PCA-convenient DP. Concerning PCA-Animal protein and alcohol pattern, we did not observe a cluster of individuals with only high intakes of meat, fish and alcohol. However, since this DP is characterized by high intakes of foods (meat and alcohol) usually consumed in a "Non-Prudent" pattern, it was significantly higher in the "Non-Prudent" cluster. Those results are in line with others studies, which also found differences in mean PCA-DP across clusters [33][34][35].
Although results between both methods were similar, they describe diet in different ways. Indeed, PCA aims to determine DP explaining variation in a set of food groups whereas cluster analysis aims to identify groups of people with different food intakes. Moreover, the format of DP is also different. An individual's dietary pattern is described through his/her membership to a group in cluster analysis whereas in PCA-DP the subject is described with his/her scores on all computed DP. Therefore, the choice of a method depends on both the desired format of the outcome but also hypothesis and aims of the study. Advantages of PCA are that it may be easier to perform as it requires less subjective researchers' decisions. However, findings from cluster analysis are easier to interpret because an individual is assigned to one cluster only whereas PCA-DP do not refer to identifiable groups within the population, and hence do not give an indication of the prevalence of a particular type of diet [35]. On the other hand, continuous factors determined by PCA may be advantageous when relationships between DP and others variables are assessed since a gradient is formed between individuals with low, medium or high values on factors. Moreover, they do not require the use of a reference category [26]. As other authors have suggested, unless the choice of one method is justified, it is advisable to use both factor and cluster analysis in order have complementary insights [36].

Strength and limitations
The main strength of this study was the use of an objective procedure to select the most appropriate clustering method and number of clusters. Compared to other internal validity indices, the stability measure has the advantage to be model free and not being optimized by any clustering method. Moreover, comparison of cluster solution and PCA-derived factors were also made. Further, this study used a recent and homogeneous design of data collection including three large randomly selected samples from three neighbour regions. Shortcomings of this study were that considered clustering methods were all heuristic-based and make basic assumptions on group structure. The reason is that since the main objective of this study was to test the objective procedure, we decided to limit its application to traditional clustering methods used in the field of dietary pattern analysis [2]. Therefore, we will also consider more sophisticated methods in the future. In addition, although the method allows distinguishing between stable and spurious clustering solutions, stability is not the only aspect of a good solution. Indeed, a stable clustering solution may still be meaningless if it does not discriminate useful subset of the overall data [37]. However, unstable solutions should not be interpreted and thus stability is an indispensable requirement [37]. For this reason, the interpretation and criticism of the clustering solution by the researcher and comparison with results obtained with PCA are still important. In addition, many others subjective decisions have still been made that are likely to influence the final solution, namely the pooling of different food items into specific food groups, the quantification of the input variables, the adjustment for total energy intake and the method of standardization. However, the robustness of the chosen solution and the consistency of the results with PCA-DP gave confidence in our results. Other limitations are the cross-sectional design of the study and the probable measurement error linked with the FFQ. Finally, although we identified dietary pattern associated with disease risk, we still do not know if this effect comes from certain component only or is the product of the addition or interaction of several food groups.

Conclusion
In summary, we used an objective methodology based on the stability of clustering solutions allowing selection of the most appropriate clustering method and number of cluster for describing dietary patterns in a population. Three main dietary patterns were identified in the Greater region. A "Convenient" and a "Non-Prudent" pattern associated with a higher cardiovascular risk and a "Prudent" pattern associated with a decreased cardiovascular risk. Those results flag the need for targeted public health initiatives promoting the benefit of a prudent dietary pattern and other healthy behaviours to relevant subgroups like men, young and less educated people, at interregional level.