Importance of details in food descriptions in estimating population nutrient intake distributions

Background National food consumption surveys are important policy instruments that could monitor food consumption of a certain population. To be used for multiple purposes, this type of survey usually collects comprehensive food information using dietary assessment methods like 24-h dietary recalls (24HRs). However, the collection and handling of such detailed information require tremendous efforts. We aimed to improve the efficiency of data collection and handling in 24HRs, by identifying less important characteristics of food descriptions (facets) and assessing the impact of disregarding them on energy and nutrient intake distributions. Methods In the Dutch National Food Consumption Survey 2007–2010, food consumption data were collected through interviewer-administered 24HRs using GloboDiet software in 3819 persons. Interviewers asked participants about the characteristics of each food item according to applicable facets. Food consumption data were subsequently linked to the food composition database. The importance of facets for predicting energy and each of the 33 nutrients was estimated using the random forest algorithm. Then a simulation study was performed to determine the influence of deleting less important facets on population nutrient intake distributions. Results We identified 35% facets as unimportant and deleted them from the total food consumption database. The majority (79.4%) of the percent difference between percentile estimates of the population nutrient intake distributions before and after facet deletion ranged from 0 to 1%, while 20% cases ranged from 1 to 5% and 0.6% cases more than 10%. Conclusion We concluded that our procedure was successful in identifying less important food descriptions in estimating population nutrient intake distributions. The reduction in food descriptions has the potential to reduce the time needed for conducting interviews and data handling while maintaining the data quality of the survey. Electronic supplementary material The online version of this article (10.1186/s12937-019-0443-5) contains supplementary material, which is available to authorized users.


Background
National food consumption surveys are essential policy instruments and have been carried out successively in many countries [1,2]. They serve many purposes, such as identifying nutrient inadequacies at the population level, assessing the risk of hazardous substances, and developing dietary guidelines [1,3].
The 24-h dietary recall (24HR) has been frequently used as the primary dietary assessment method for collecting national food consumption data [4,5]. As an open-ended and retrospective method, 24HR is less likely to alter diet behaviour and has a lower literacy requirement for the participants than food records [6,7]. Traditionally, interviewers collect information about the foods consumed during the preceding day or the previous 24 h by triggering the participant's memory using different cues to increase the completeness of the survey [8]. This method collects sufficient food consumption data but has a long interview duration and a rather complicated data handling procedure [9,10].
With the advent of computers, several comprehensive dietary assessment protocols have been incorporated into computer-assisted 24HR interview software used in large-scale studies [5,11,12]. These protocols standardize the dietary data collection procedure and help the respondents recall their food intake to the maximum extent [13]. Examples include the Automated Multiple-Pass Method (AMPM), developed by the U.S. Department of Agriculture (USDA) to conduct the dietary interview for the National Health and Nutrition Examination Survey [14]. In Europe, the International Agency for Research on Cancer (IARC) has developed the menu-driven 24HR software GloboDiet (previously known as EPIC-Soft), which was validated to be used in food consumption surveys in European countries [15,16].
In the multiple-pass protocol of GloboDiet, the most time-consuming step is the collection of detailed information on each consumed food (i.e., food description). Details of each food item are collected through prompt windows for facets, which represent various characteristics of food, such as fat content, cooking method, and brand name. The predefined answers to the facet questions were called descriptors, such as full fat, semi-skimmed, and skimmed [17]. The use of facets and descriptors standardize the interview among different interviewers and characterize the consumed foods in aspects relevant for the study purposes, such as the content of nutrients and potentially hazardous chemicals [18][19][20].
Although applying a large number of facets and descriptors provides a high level of detail, the duration, and cost of the survey rise accordingly [7]. Specifically, the interviewers ask more questions during the interview, and the dieticians have to link all new food-descriptor combinations to the food composition database manually after the interview [10,21,22]. Furthermore, some food characteristics that require reading food labels (e.g., fortification) or knowledge about the preparation of the food (e.g., type of fat used) are difficult for many of the participants [1,23]. Therefore, an investigation is needed on whether a reduction in food characteristics could improve the cost-effectiveness while maintaining data quality of the survey.
The current study aims to evaluate facet importance in predicting nutrient contents of foods, the impact on population nutrient intake distributions and the time saved after deleting less important facets from the data collection procedure.

Data collection
In the Netherlands, Dutch National Food Consumption Surveys (DNFCS) monitors the food consumption of the general Dutch population. The data used in this study came from the DNFCS performed from 2007 to 2010 on the diet of children and adults aged 7 to 69 years. Study design, recruitment, and results have been described elsewhere [24]. Subjects were excluded if they were pregnant, lactating, institutionalized or did not speak adequate Dutch. In total, 3819 participants (69%) were qualified and responded to the survey.
Dietary intake of participants was collected through two 24HRs on non-consecutive days with 2-6 weeks in between. Trained dieticians conducted the 24HRs for 2522 persons aged 16 and older through telephone interviews. The 24HRs for 1297 children between 7 to 15 years old were collected by face-to-face interviews with the presence of their caretakers during home visits. All interviews were conducted following the same data collection and handling protocol.
During both face-to-face and telephone 24HR interview, dieticians used the multi-step computer-based interview software GloboDiet to guide the interview and to enter the data in the computer. The average time needed to complete one face-to-face 24HR interview and one telephone interview was 41 min and 46 min, respectively. The GloboDiet interview consists of the following five steps: 1. Collection of the general information, 2. Listing of foods and recipes consumed throughout the day, 3. Specification of details of foods by choosing descriptors of relevant facets and consumed amounts, 4. Quality check of inaccurate input, and 5. Dietary supplement intake [15]. The collection of details in step three took about 15 min. IARC provided common facets and descriptors for countries that used Globodiet as their data collection software. The actual selection of facets and descriptors could be adjusted according to country/study-specific situations. A total of 16 facets with varying numbers of descriptors was selected by experienced dieticians to be included in the GloboDiet accustomed for DNFCS 2007-2010, based on the Dutch food market and the purposes of the data collected (Table 1).

Data handling
The total collected consumption data from all participants for the two 24HRs has 219,006 food records, with 350,369 descriptors ranging from 0 to a maximum of 8 for each record. A number of 26,679 unique combinations of foods with descriptors was reached. Trained dieticians linked all combinations to 1599 most appropriate food codes in the Dutch National Food Composition Database (NEVO Table 2011/3.0), which contains energy, macro-and micronutrient contents of 2389 food codes in total [25].

Statistical analysis
To assess the importance of the GloboDiet facets in predicting the nutrient contents of consumed foods in DNFCS, we used random forest [26]. Random forest is a prediction model that consists of a multitude of decision trees. Each tree is trained on different subsets of training data, and the remaining data (not used for the training) are used to estimate prediction error and variable importance. In our study, foods consumed by all participants in both 24HRs were used for predicting facet importance, the number of randomly selected variables to be considered when splitting the tree at each node was set to its default value (mtry = Total number of predictor variables/3); the number of trees for each nutrient was set at 10,000. Stratified by food group, the importance of a facet (denoted by %IncMSE), was calculated as the percentage increase in prediction error, when data for that facet were permuted in the dataset while keeping data for the other facets unchanged. The random forest algorithm was applied through the randomForest package in Rstudio 1.1.383.
The 24HR variables of 16 facets, food IDs (a series of numbers identifying food items) and food subgroups (elements of main food groups) were regarded as predictor variables. The detailed food group information can be found in Additional file 1. The energy and 33 macroand micronutrients were regarded as response variables and were predicted one by one with the prediction variables. Food IDs were treated as continuous variables because it exceeds the limit of 32 levels allowed to categorical variables in the implementation of random forest. As comparable foods were numbered sequentially, treating food ID as continuous is reasonable. Facets were treated as categorical variables. Facet "Flavoured/added components" was separated into three sub-groups based on the category (nuts, sugary, savoury) of its descriptors, since the number of descriptors also exceeded the allowed 32 for categorical variables like in food IDs. The variable brand name was not included as a predictor, as this consists of a free text field, yielding many unordered categories that were difficult to separate into sub-groups. Instead, we included the facet "Brand name (yes/no)" that indicated whether this brand name field was filled in or not.
To facilitate the comparison of the relative importance of facets between nutrients, within each food group and each nutrient, %IncMSEs were normalized by dividing them by the highest %IncMSE over the facets. The maximum normalized %IncMSE for the facet across all nutrients would be retained for each food group. After deleting facets with a maximum normalized %IncMSE lower than 0.80 in each food group, small effects on population nutrient intake distributions were observed, therefore a cut-off point at 1.00 was chosen for more significant results. Hence, in each food group, facets with a normalized value below 1.00 for all nutrients were considered unimportant.

Simulation study
We conducted a simulation study to investigate if deleting unimportant facets could affect the population nutrient intake. We summarized the average nutrient intake of two 24 HRs for each participant and calculated the population nutrient intake distributions in both the original and simulation scenarios. The first step was to create the simulation datasets. After deleting one or more unimportant facets, we linked new unique food-descriptor combinations to the national food composition database NEVO semi-automatically. As illustrated in Fig. 1, for each new combination, a NEVO code was assigned based on the NEVO codes that have been linked to the same food with the most similar descriptor combinations by dieticians during the survey period. To identify the most similar descriptor combination, we gave combinations a positive score for each identical pair of descriptors (equal to the maximum normalized %IncMSEs) and a penalty for descriptors that were different (equal to the negative maximum normalized %IncMSEs). The scores were summed, and the NEVO code of the food-descriptor combination with the highest score was assigned to the combination that needed to be relinked. In case there were more than one NEVO codes with the same highest score, or when no descriptors were left for a food item after deleting unimportant facets, the NEVO code of a food-descriptor combination with a higher consumed quantity would be selected. In case the consumed quantities were also the same (occurred in 38 cases), a researcher decided on NEVO code selection.
To summarize the population nutrient intake distributions in both the original dataset and the simulation dataset, the energy and nutrient contents for 100 g of foods in NEVO were multiplied with the quantities consumed in DNFCS 2007-2010, averaged over two days of each participant. All results were weighted for small deviances in sociodemographic characteristics (age, sex, region, the degree of urbanisation and educational level), the day of the week and the season of data collection, to give results that are representative for the Dutch population and representative for all days of the week and all seasons. The mean, median, 5th, 25th, 75th, 95th percentile and the percent differences of consumption per nutrient between the original and simulation dataset were calculated for the total population and stratified by gender and age group (7-18 years old and 19-69 years old). The population nutrient intake distributions were conducted using the SAS 9.4, and the percent difference between the original and simulation dataset was calculated using Excel 2016 software. Table 2 shows the normalized maximum importance (%IncMSEs) of 16 facets in predicting the nutrient contents of food items within each of 17 food groups. Using a cut-off point of 1.00, we identified a total of 64 out of 112 facets across food groups as unimportant, whereas a total of 50 facets fell below the cut-off point at 0.80. For Fig. 1 Flow chart of the NEVO code Reassignment Protocol. A NEVO code was assigned to each relinking combination according to the NEVO codes of the same food with the most similar descriptor combinations that have been linked by dieticians during the survey period. The combinations received a positive score for each identical pair of descriptors (equal to the maximum normalized %IncMSEs) and a penalty for descriptors that were different (equal to the negative maximum normalized %IncMSEs). The scores were summed, and the NEVO code of the food-descriptor combination with the highest score was assigned to the combination that needed to be relinked. In case there were more than one NEVO codes with the same highest score, or when no descriptors were left for a food item, the NEVO code of a food-descriptor combination with a higher consumed quantity would be selected. In case the consumed quantities were also the same (occurred in 38 cases), a researcher decided on NEVO code selection Table 2 The maximum normalized %IncMSEs of the existing facets in each food group  a cut-off point at 0.80, 22% of the 350,369 facet descriptors were deleted in the total food consumption database. The majority of the percent difference between percentile estimates of the population nutrient intake distributions before and after facet deletion ranged from 0 to 1%, while only 2% cases ranged from 1 to 5% (Additional file 2). From Table 2, for a cut-off point at 1.00, no facets were unimportant in the food groups 'Fats and oils' and ' Alcoholic beverages' , whereas all facets were unimportant for 'Cakes and sweet biscuits'. The food group 'Miscellaneous' has the largest amount of unimportant facets than the rest of the food groups. In the 'Meat' group, most facets had zero effect in predicting food groups, including 'Source' , 'Packing medium' , 'Fat content' , 'Brand name (yes/no)' , 'skin consumed, and 'visible fat consumed'.

Results
From the facet perspective, 'Brand name (yes/no)' and 'Packing medium' were unimportant for the most of the food groups (10 and 7 food groups, respectively). The number of deletions ranged from 1 to 5 times for the rest of the facets. 'Source' and 'Visible fat consumed' were unimportant for all the food groups for which they are relevant (3 and 1 food groups, respectively). On the other hand, 'Physical state' and 'Cooking method' were strong predictors (importance of 1.00) for the largest number of food groups. Facet 'Type of packing' was only available for the food group 'Fats and oils' and was a strong predictor for that food group. Despite 'Brand name (yes/no)' was unimportant for most of the food groups, it was a strong predictor for food group 'Cereals' , 'Fats and oils' , ' Alcoholic' and 'Non-alcoholic beverages'. Full results of the facet importance for each nutrient in each food group can be found in Additional file 3.
In the original total food consumption database, 35% (121,015 out of 350,369) of the total descriptors used were identified as unimportant, which has resulted a NEVO code change of 11% (2923 out of 26,679) combinations in the unique food dataset and 3.7% (8196 out of 219,006) combinations in the total food consumption dataset.
After reassigning the NEVO codes, the population means and percentiles of two days' average energy and nutrient intakes in DNFCS 2007-2010 were calculated, as well as the percent difference between the original and the simulation result. Table 3 shows the results of energy and ten nutrients that were commonly found in the nutrition facts label. The results of all nutrients can be found in Additional file 4. The majority (79.4%) of the percent difference between distribution percentiles before and after facet deletion ranged from 0 to 1%, while 20% cases ranged from 1 to 5% and 0.6% cases more than 10%. Percent difference larger than 1% were mainly found in vitamins. Differences more than 10% appeared mostly in vitamins for 7-18-year-olds and in the extreme percentiles P5 and P95. Some of the differences that were larger than 10% were small as the absolute difference. For example, the most significant difference of 14.1% was for the P95 of vitamin B6; but the absolute difference of the two scenarios was 0.5 mg (rounded to mg). No general patterns were found on nutrient over-and underestimation after facet deletion for most nutrients. However, lower vitamin C contents were found in each percentile after facet deletion for all age groups, whereas higher amounts of vitamin B group were found after facet deletion.

Discussion
To enhance the efficiency of data collection and handling of GloboDiet 24HRs, we explored the option of deleting less important food characteristics (facets) from the interview. The importance of each facet in predicting nutrient contents in foods was determined by the random forest algorithm. When the 35% least predictive facets were deleted from the dataset of the Dutch National Food Consumption Survey 2007-2010, the difference between the original and simulated population nutrient intake distributions was small for the majority of the nutrients.
There are several possible explanations for certain facets to be less or more predictive in certain food groups. One reason for less predictive facets is that some facets were only applicable to a few food items in certain food groups, and those food items were rarely consumed. An example of this is the facet 'Enriched/fortified' in the food group 'Cakes and sweet biscuits'. A second reason is a lack of variation in the chosen descriptors within a facet. An example of this is the facet 'source' in dairy products since cow milk is the basis for the majority of the consumed dairy products in the Netherlands. Another possible explanation for the less predictive facets is the use of a generic food composition database NEVO [27]. Some facets might have been important for predicting true nutrient levels but not for averaged nutrient levels of generic foods. For example, the facet 'Brand name (yes/no)' , which could typically be a good predictor for nutrient levels in industrially processed foods [28], showed low predicting power for most of the food groups in this study.
In contrast, some facets showed strong predictive power in estimating nutrient contents in certain food groups. The facet 'Type of packing' predicted strongly for the 'Fats and oils' group, because the type of packing materials could distinguish solid from liquid fat. Hence, the variance in the fat content between solid and liquid fat could be differentiated by the facet 'Type of packing'. Similarly, as can be expected from a nutrition point of view, facet 'Physical state' , 'Sugar' and 'Fat content' were strong predictors for most of their allocated food groups, Table 3 The population means and percentiles of two days' average energy and ten nutrients' intake distributions before and after facets' deletion at cut-off at 1.00 Nutrients  except for unprocessed products (e.g., fruit, meat, and fish).
In terms of comparing nutrient intake distributions before and after the facets had been deleted, a difference of less than 10% was found for most nutrients. A similar finding was observed in a study that investigated the effect of a concise versus an extensive food list in a self-administered web-based 24HR tool. They found that Table 3 The population means and percentiles of two days' average energy and ten nutrients' intake distributions before and after facets' deletion at cut-off at 1.00 (Continued)  the differences between population nutrient intakes assessed by two methods were less than 6% [29], which is consistent with our study that the majority of the differences fell below 5% before and after facet deletion. In this study, the small difference could be explained by the fact that 96.3% of the combinations were relinked to the same food code in the food composition database. From this, we speculated that sufficient information for NEVO linkage could be provided by the food names and remaining facets. For those combinations with deleted facets that were linked to different food codes in the food composition database, the difference in nutrient contents of the original and alternative food codes may have been small, or the foods were consumed by few persons or in small amounts and therefore did not influence population nutrient intake distributions substantially. Specifically, a significant decrease in the amount of vitamin C was found for children in our study, and the reason was speculated to be the deletion of the facet 'Enriched/fortified' in the food group 'Non-alcoholic beverages'. According to the report of 2007-2010 survey, 'Non-alcoholic beverages' and 'Meat and meat products' together, contribute for one third to the total vitamin C intake partly due to food fortification and processing [24]. Hence, beverages with fortification were linked to NEVO codes for products without fortification and resulted in a lower vitamin C content. On the other hand, a large increase in the amount of vitamin B group was found for children. A possible explanation would be the deletion of 'Flavoured component' in the food group of 'Cereal' , which may have caused a linkage between flavoured cereals to cereals without flavours (i.e., whole wheat cereals) which normally have higher vitamin B contents. A closer investigation should be conducted before deleting facets in the real setting.
To our knowledge, this is the first study investigating the impact of reducing food descriptions in interviewbased 24HRs for the estimation of population nutrient intake distributions. Until now decisions on the facets that were included in the 24HR interview of DNFCS were based on expert judgment. A strength of our approach is that both evaluating the facet importance and assessing their impact was data-driven. Another strength is the use of the random forest for the identification of unimportant facets. This prediction model is more efficient in large datasets, has a lower risk of overfitting and is better in dealing with correlated predictors than multiple linear regression [30]. However, the applied random forest implementation only allows nominal variables with a limited number of levels as predictors. Therefore, the nominal variable "food ID" was treated as a continuous variable, and the importance of the information on the full brand and product name of each food could not be evaluated. Also, the importance of the facet "Cooking method" could not fully be assessed, since the added fat in case of frying was not included in the nutrient content of the food, but became a separate food item in the food consumption database. Another limitation of our study was the use of a semi-automated protocol of reassigning a different NEVO code to combinations with deleted facets rather than applying the original approach of 'manual' linkage by dieticians. However, manual matching would only have further decreased the effect of facet deletion, so we do not think our conclusions would have been different. Finally, the impact of facet reduction on respondents' answers during the food description part of the interview was not assessed. Although a face-to-face or telephone 24HR interview has generally smaller self-reporting error than other methods, measurement error still exists (i.e., rely on memory, underreporting) [6]. However, we assume that the effect of facet reduction on self-reporting error will be small.
The scope of our analyses focused on the nutrition aspects in deleting facets, while other aspects can be important as well. One example is the facet 'Physical state' , which is essential in quantifying the consumed foods, e.g., coffee powder is quantified differently than coffee as a beverage. Moreover, deleting facets that could estimate exposure to potentially harmful substances should be carefully considered for practical use. For example, facets related to food preparation should be kept for some foods like meats since it is a crucial food characteristic to identify microbiological risks. In principle, the procedure described in this manuscript can also be applied to evaluate facet importance for food chemical distributions. The facets and descriptors of the GloboDiet software can be tailored for any new study [17]. Researchers use this software should thus consider carefully which food characteristics are important for their study aims before the start of a study.
The objective of looking at the reduction in food characteristics was to enhance efficiency in conducting future surveys. Less extensive food description would result in a shorter interview duration and less work in linking the food with the food composition database. The time needed to go through facets for all consumed foods was estimated to be 15 min out of a 44 min 24HR interview. Without 35% of the unimportant facets, the time saved for one interview would on average be 5 min. In a survey with 3819 participants that are interviewed twice, a total of 637 h would be saved. Moreover, less extensive food description during data collection would lead to fewer unique food-descriptor combinations reported in a survey. In the data handling phase, each unique food-descriptor combination needs to be linked manually to the food composition database, which would cost 5 to 10 min approximately. Hence, a reduced number of 3534 unique combinations (from 26,679 to 23,145) after deleting less important facets would save around 442 h. To sum up, we estimated that around 1079 h would be saved for both data collection and handling if facet deletion would be applied.
The current study focused on reducing the number of facets as a potential efficiency measure for a national food consumption survey. Other alternative efficiency options have also been studied elsewhere. One alternative is to use 24-h dietary recall software to guide the interviews in which the food list is directly related to the foods in the national food composition database [9,31]. The reason why GloboDiet did not choose this option was to give flexibility for new foods that have entered the food market (but have not been included in food composition databases yet), to standardize food description across different countries that use the same software, and to be able to collect characteristics of food relevant for other purposes than nutrient intake estimations [17]. A more cost-efficient alternative regarding dietary assessment is to use self-administered methods. However, the accuracy and reliability of those tools need to be further evaluated, due to self-reporting errors, and various levels of acceptance by different age-groups [32]. Furthermore, matching food consumption and food composition data could be more efficient through automatic or semi-automatic linkages. In this study, decisions on NEVO code reassignment for food-descriptor combinations were made based on a simple algorithm with the results of the random forest algorithm. For matching future food consumption data automatically or semi-automatically, random forest prediction models using available previously matched food consumption and food composition data as training dataset could be developed. Similar approaches have been developed in some studies including a semi-automatic food matching technique using machine-learning and a natural language processing approach. These approaches have shown a promising future of replacing manual linkage between food and food composition database [33,34].

Conclusion
In conclusion, the data-driven procedure that combined random forest prediction with a simulation study was successful in identifying less important characteristics of food description. After deleting those less important characteristics, there was little impact on the population nutrient intake distributions for most nutrients, thus yielding a promising approach for saving labour and costs.