Article Text

Download PDFPDF

Early dietitian referral in lung cancer: use of machine learning
  1. Michael Chung1,
  2. Iain Phillips2,
  3. Lindsey Allan3,
  4. Naomi Westran3,
  5. Adele Hug3 and
  6. Philip M Evans1,4
  1. 1CVSSP, University of Surrey, Guildford, UK
  2. 2Edinburgh Cancer Centre, Western General Hospital, Edinburgh, UK
  3. 3Department of Nutrition and Dietetics, Royal Surrey County Hospital NHS Foundation Trust, Guildford, UK
  4. 4Chemical, Medical and Environmental Science, National Physical Laboratory, Teddington, UK
  1. Correspondence to Dr Iain Phillips, Edinburgh Cancer Centre, Western General Hospital, Edinburgh EH4 2XU, UK; iain.phillips{at}


Objectives The Dietetic Assessment and Intervention in Lung Cancer (DAIL) study was an observational cohort study. It triaged the need for dietetic input in patients with lung cancer, using questionnaires with 137 responses. This substudy tested if machine learning could predict need to see a dietitian (NTSD) using 5 or 10 measures.

Methods 76 cases from DAIL were included (Royal Surrey NHS Foundation Trust; RSH: 56, Frimley Park Hospital; FPH 20). Univariate analysis was used to find the strongest correlates with NTSD and ‘critical need to see a dietitian’ CNTSD. Those with a Spearman correlation above ±0.4 were selected to train a support vector machine (SVM) to predict NTSD and CNTSD. The 10 and 5 best correlates were evaluated.

Results 18 and 13 measures had a correlation above ±0.4 for NTSD and CNTSD, respectively, producing SVMs with 3% and 7% misclassification error. 10 measures yielded errors of 7% (NTSD) and 9% (CNTSD). 5 measures yielded between 7% and 11% errors. SVM trained on the RSH data and tested on the FPH data resulted in errors of 20%.

Conclusions Machine learning can predict NTSD producing misclassification errors <10%. With further work, this methodology allows integrated early referral to a dietitian independently of a healthcare professional.

  • cachexia
  • lung
  • symptoms and symptom management

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Summary box

What was already known?

  • Machine learning often requires more than 100 measures to be effective in classifying outcomes

  • It has not been previously been applied to dietetic screening.

What are the new findings?

  • Machine learning can be used to predict need to see a dietitian for patients with lung cancer.

  • The machine learning approach used 5–10 measures as opposed to over 100, with a high level of accuracy (>90%).

What is their significance?

  • clinical: it establishes the principle that an assessment for need for dietetic input can happen independently of the patient and staff, which allows for more timely referrals.

  • Research: the software could be made freely available and integrated into practice at minimal cost.


Lung cancer is the leading cause of cancer death in the UK and worldwide. The introduction of new therapies brings the potential for lung cancer outcomes to improve with increased duration and intensity of treatment; therefore, patients need to be fitter to complete treatment.1–3 Cancer associated malnutrition has resulted in greater challenges when delivering chemotherapy.4–7 Assessing nutritional status early is, therefore, vital in patients with advanced lung cancer.8 9

European guidelines suggest that all patients should be screened for malnutrition at diagnosis.10 Tools such as the Patient Generated Subjective Global Assessment (PG-SGA) and Malnutrition Universal Screening have identified high rates or risk of malnutrition in these patients, but require time and skill to interpret.11 12 Available standardised tools are unwieldy and time-consuming for mass screening in busy, stretched, clinical settings.

The Dietetic Assessment and Intervention in Lung Cancer (DAIL) trial was a cohort study, in which 96 patients with stage IIIB and IV lung cancer were recruited before starting systemic therapy. Participants underwent multiple assessments. Need to see whether a dietitian was determined by the PG-SGA, since it incorporates symptom screening, weight change, activity level and food intake.13 Two per cent of patients did not require dietetic intervention, 20% required education, 78% required dietitian review, of which 52% had critical need for dietetic intervention.8

This paper presents a machine learning approach, to test an automatic way of screening for the need to see a dietitian (NTSD). Particularly, whether a small number of easily acquired pieces of information could help identify patients who required dietetic input. This data could form part of the referral for investigation of suspected cancer, thereby ensuring early signposting to a dietitian. It was determined that a clinically useful tool should be based on as few as 10 or 5 questions in order to achieve 90% accuracy.

The machine learning approach adopted for this study was a support vector machine (SVM). This method trains a computer to split the patient group into those who do, and those who do not, NTSD, based on the values of a set of input measures (and knowledge of whether the patient needed to see a dietitian). The trained computer is then used to predict NTSD for future cases. With this approach, a subset of input information can be tested to determine whether this subset is sufficient to predict NTSD with a desired accuracy.


Study participants

Complete data sets from patients consented to the DAIL trial between April 2017 and June 2019 were used for this analysis. Ethical approval for DAIL was obtained from the Camberwell St Giles Research Ethics Committee, London (reference 16/LO/2143). The study protocol is included as supplementary materials. The following assessments were complete for each participant: PG-SGA nutrition screening tool, EORTC (European Organisation for Research and Treatment of Cancer) quality of life surveys QLQ (Quality of Life Questionnaire) C30 and LC 13, G8 frailty assessment, hand grip strength, spirometry, neutrophil count, albumin and psoas muscle surface area to assess for sarcopenia. The proportion of patients who fitted the criteria for cachexia was also calculated,.14 15 This resulted in 137 data points per individual patient. The primary outcome of the DAIL trial was ‘NTSD’ (defined as PGSGA score 4–8) and ‘critical need to see a dietitian’ (CNTSD, PGSGA score ≥9). A univariate analysis of all the 137 variable fields of information (excluding the PG-SGA) was performed. The Spearman rank correlation was calculated between each variable and both NTSD and CNTSD.

These results were used to select input data for the SVM models. In SVM1, all markers with a correlation coefficient of magnitude above 0.4 were included. The threshold of 0.4 was a pragmatic choice, to establish 10–20 inputs for the SVM model. Subsets of 10 (SVM2) and 5 (SVM3a) measures with the strongest correlation were identified and used in subsequent models. Finally, a set of five biomarkers expected to be complementary and correlate with the NTSD were selected (SVM3b) by authors (IP, NW, AH and LA) before the analysis. These were ‘EORTC: appetite loss’, ‘EORTC: fatigue: were you tired?’, ‘G8: food intake’, ‘G8: weight loss’ and ‘Cachexia: percentage weight loss’. The SVM model was trained on both NTSD and CNTSD.

The performance of the machine learning models was evaluated using the misclassification error rate, which is evaluated by putting the data used to train the model back into the model and measuring the success rate in predicting the outcome. The misclassification error is the fraction of cases that are wrongly diagnosed.

Two methods were taken to evaluate the accuracy of the SVM approach, first to combine all data sets and evaluate performance using the misclassification error. Then, second, to train the SVM using The Royal Surrey NHS Foundation Trust (RSH) data and evaluate misclassification loss on the Frimley Park Hospital (FPH) data. The SVM model used in this work was the function fitcsvm in Matlab ( with the standard linear kernel.


Study population

76 complete data sets were available from the DAIL trial (RSH×56, FPH×20). Of the RSH group, 48 (86%) scored >4 on PG-SGA (NTSD) and 32 (57%) scored >9 (CNTSD). Of the FPH group, 14 (70%) scored >4 and 8 (40%) scored >9.

The univariate analysis showed that 18 and 13 variables for NTSD and CNTSD, respectively, had a Spearman’s correlation of magnitude 0.4 or above. SVM1 results using these variables showed a misclassification error of under 10% in all cases (3% for NTSD and 7% for CNTSD).

The 10 best correlates with NTSD and CNTSD from the univariate analysis were: (1) EORTC: appetite loss, (2) EORTC: fatigue, (3) G8: food intake, (4) G8: weight loss, (5) Cachexia: % weight loss, (6) need to stay in bed or chair during the day, (7) limited in doing either work or other daily activities, (8) overall health, (9) overall quality of life, (10) Nausea. These 10 best correlates were the same for both hospitals and for both NTSD and CNTSD. The top five correlates differed slightly between the two trusts with (1), (3), (4) and (5) being in the top five for both trusts and for both NTSD and CNTSD.

Table 1 shows the results of the various SVM models. For the rows labelled ‘ALL’, the model uses data from both RSH and FPH, and misclassification error is calculated for both hospitals. For RSH and FPH models, each hospital is modelled separately. Table 1 is limited by the small number of test points in the FPH data (n=20) but shows, for NTSD, promise that a model with 5–10 parameters and a better than 10% misclassification rate may be achievable.

Table 1

Misclassification error for all SVM models.

Discussion and conclusions

SVM has been employed to identify patients who NTSD, using 5 or 10 selected measurements to achieve classification accuracy of 90%. The significance for detecting malnutrition early in the patient pathway may result in improved outcomes and tolerance to treatment. Its automation overcomes the issue of understanding and interpreting malnutrition screening tools. Further studies with larger cohorts are required at different points in the referral pathway to challenge this hypothesis.

Using five classifiers in patients from RSH, only one patient was misclassified. Patients in need of dietetic input would be identified early and undergo a thorough assessment by a dietitian. This would enable all patients to be screened for malnutrition increasing compliance with national and European guidance. During the COVID-19 pandemic, many consultations are being done at a distance. An automated independent assessment process would allow quicker and more efficient referral to a dietitian.

A limitation is that the SVM model gives a prediction but does not result in a probability. As a rule of thumb, smaller data sets tend to have better outcomes in machine learning approaches. Uncertainty is greatest in those who are borderline for the NTSD. Those who are very likely to require dietetic input or those who definitely do not are the least likely to be misclassified.

Ethics statements

Patient consent for publication

Ethics approval

The DAIL trial received formal ethical approval from Camberwell St Giles Research Ethics Committee, London (reference 16/LO/2143).


We acknowledge the contribution of the patient representatives who gave feedback on the DAIL trial, at the 2016 NCRI conference dragons’ den event. Dr Phillips is supported by an NHS Research Scotland Career Researcher Fellowship and acknowledge Cancer Research UK part funding of the Edinburgh Cancer Research Centre.



  • Twitter @LindseyAllan6

  • Contributors Study concept and planning: IP, LA and PME. Analysis of data: MC. Interpretation of data: all authors. Contribution and critical review of journal article: all authors.

  • Funding The DAIL study was funded by Chugai Pharma UK Ltd.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; internally peer reviewed.