Natural language processing and network analysis provide novel insights on policy and scientific discourse around Sustainable Development Goals

Natural language processing and network analysis provide novel insights on policy and scientific discourse around Sustainable Development Goals The United Nations’ (UN) Sustainable Development Goals (SDGs) are heterogeneous and interdependent, comprising 169 targets and 231 indicators of sustainable development in such diverse areas as health, the environment, and human rights. Existing efforts to map relationships among SDGs are either theoretical investigations of sustainability concepts, or empirical analyses of development indicators and policy simulations. We present an alternative approach, which describes and quantifies the complex network of SDG interdependencies by applying computational methods to policy and scientific documents. Methods of Natural Language Processing are used to measure overlaps in international policy discourse around SDGs, as represented by the corpus of all existing UN progress reports about each goal (N = 85 reports). We then examine if SDG interdependencies emerging from UN discourse are reflected in patterns of integration and collaboration in SDG-related science, by analyzing data on all scientific articles addressing relevant SDGs in the past two decades (N = 779,901 articles). Results identify a strong discursive divide between environmental goals and all other SDGs, and unexpected interdependencies between SDGs in different areas. While UN discourse partially aligns with integration patterns in SDG-related science, important differences are also observed between priorities emerging in UN and global scientific discourse. We discuss implications and insights for scientific research and policy on sustainable development after COVID-19. Since their conception at the Rio + 20 conference, the United Nations’ Sustainable Development Goals (SDGs) were intended to be both independent, each addressing a distinct sustainability domain, and interconnected, reflecting the many potential interdependencies, synergies and trade-offs between the 17 goals (Table 1) and their 169 constituent targets1. These interdependencies are diverse, complex, and potentially varying in space and time. Mapping and understanding them is essential to achieve policy coherence in sustainable development2,3,4, harmonizing and integrating policy action in different sustainability sectors to maximize mutual reinforcements and minimize conflicting outcomes4,5,6,7,8. The study of interlinkages between SDGs has garnered increasing attention among sustainability scholars in recent years9. At the same time, there have been growing calls for a reorganization and prioritization of policy and scientific efforts on sustainable development around a few, coherent sets or clusters of integrated SDGs3,10,11, such as the six ‘entry points’ of the most recent UN Global Sustainable Development Report12.Table 1 Sustainable development goals: abbreviation and full descriptions.However, the extant scholarship about SDG interactions and clusters has important limitations, typically relying on manual coding of SDG descriptive texts, qualitative reviews of scientific documents and expert opinions, or observed covariance in SDG indicator data. Building on and complementing this literature, the present study proposes novel empirical methods for identifying SDG interconnections by applying Natural Language Processing (NLP) tools and network science techniques to official UN progress reports and scientific publications around SDGs. We seek to address two main questions: (i) What interconnections and clusters of interrelated SDGs emerge from official UN policy discourse around sustainable development? (ii) Are similar interconnections also found in scientific discourse about the SDGs?At the intersection of linguistics, computer science, and machine learning, NLP techniques can be used to detect key words and phrases, extract topics and, crucially for this study, quantify overlaps and relationships between texts13,14,15. These relationships can be analyzed as networks, uncovering complex structures of proximity and distance between documents. In the first part of the article, we use this combination of methods to (i) describe and summarize international policy discourse around SDGs, as represented by official UN reports about the progress made yearly towards each goal; (ii) map networks of SDG interconnections that emerge from UN policy discourse; and (iii) identify clusters of SDGs which the UN tends to discuss similarly, implying they could be addressed simultaneously by policymakers or scientists.The first wave of scholarship about SDG interdependences was dominated by theoretical classifications of SDGs and their relationships. Attempts to identify clusters of SDGs pertaining to similar domains16,17 tended to group them by focus on the natural environment (SDGs 6, 12, 13, 14, 15); on basic human needs or well-being (SDGs 1, 2, 3, 4, 5, 10, 16); and on services, prosperity, and economy (SDGs 2, 6, 7, 8, 9, 10, 11, 12). More granular theoretical classification schemes exist, dividing these broader categories into sub-areas, such as sustainable resource use and earth preconditions within the ecological domain18. Such models are limited by their exclusive attention to SDG classification and clustering, ignoring pairwise relationships between goals and the complex networks they form.Networks of SDG interdependencies were the subject of a second wave of research. A seminal study in this area quantified the degree of synergy between SDGs by examining targets that are shared or cross-referenced in their official descriptions, targets, and indicators6. However, this effort was criticized for failing to account for trade-offs and areas of potential conflict between goals5,19. A well-known framework developed by the International Science Council (ICSU) indexed interlinkages between SDGs on a 7-point scale ranging from ‘cancelling’ to ‘indivisible’7,20. This approach offers a more comprehensive means of describing SDG synergies and trade-offs, and has since been widely used to quantify the strength and direction of the relationships in SDG networks2,3,21. This work has mostly relied on qualitative appraisal of literature by experts, or on qualitative coding and evaluation of SDG descriptive texts. However, in the absence of quantitative and reproducible methods to measure SDG interlinkages, it is difficult to know how the relationships described in this body of research are influenced by the professional backgrounds and subjective views of experts and coders, rather than incorporate objective characteristics of policy, scientific discourse, or social and environmental phenomena relevant to the SDGs.A third, growing wave of scholarship has begun assessing interdependencies between SDGs based on empirical data on the covariance between SDG indicators4,22,23,24,25,26. These studies, often based on system dynamics, have used longitudinal indexes of sustainable development progress to identify SDGs on which advances are made concurrently, and those whose indicators trend in opposite directions23,24,25,26. Efforts are also being made to collect data for the empirical application of the ICSU framework, but this literature is still in its infancy8,27. While this scholarship provides crucial insights about SDG interlinkages, it requires costly production and monitoring of goal indicators and is hindered by the scarcity of indicator data for certain SDGs and targets4,21,22,28. Further, this type of work focuses on simultaneous covariance in SDG indices, failing to capture lagged interdependencies such as those between certain social and environmental indicators3.Together with policy coherence, scientific integration around SDGs has emerged as a central challenge in recent years. Scientific evidence and knowledge relevant to SDGs are dispersed and siloed across different disciplines, institutions, geographical scales, and locations—a fragmentation that creates a critical impediment to the advancement of science and policy on sustainable development3,10,29,30. In particular, natural science research has been criticized for failing to incorporate insights from the social sciences and environmental humanities, resulting in limited ability to translate scientific findings into positive change31. In the second part of this article, we turn attention to scientific integration around SDGs and ask if global scientific debate agrees with international policy discourse on areas of SDG interconnection. We address this question by examining the extent to which the strongest SDG interdependencies emerging from UN official documents are reflected in scientific integration within sustainable development research. When topic combination and collaboration in SDG-related science align with the SDG interconnections observed in global policy discourse, we can also conclude that there is stronger evidence in support of the identified interdependencies between SDGs. Recent scientometric scholarship has investigated scientific collaboration and diversity in research teams in terms of interest areas, disciplines, geography and organizational affiliations, and their effects on scientific production32,33,34,35. We draw on this literature to examine collaboration and diversity in scientific articles classified by SDG relevance.To the best of our knowledge, this is the first study to apply NLP methods (in combination with network analysis) to the problem of uncovering SDG interconnections9. Importantly, unlike prior research based on expert elicitation, we do not rely on our own (or others’) appraisal of the intensity and direction of SDG interactions. Moreover, unlike studies of indicator covariance, our method is not constrained by the scarcity of indicator data or limited to analysis of simultaneous SDG interactions. Supplementing existing approaches and overcoming some of their limitations, the methods presented here provide a new lens for the study of SDG interactions and could become integral to ongoing efforts to map SDG interlinkages and clusters, and advance both policy coherence and scientific integration on sustainable development.We analyzed the corpus of all ‘Progress and Information’ (P/I) reports presented by the UN Economic and Social Council (ECOSOC) to the UN Secretary General about each SDG each year, from 2016 (the first year such reports were produced) to 2020 (the year of the last available reports)36. The reports describe annual global progress made toward each SDG and provide other descriptive information about each goal. All 85 reports (17 SDGs, 5 years) were scraped and cleaned with standard text pre-processing procedures and concatenated by SDG. The length of the resulting 17 P/I documents ranged from 1662 to 6743 tokens (mean = 3092, sd = 1363 tokens). To distill salient information, we applied standard NLP text preprocessing steps, including tokenization, lemmatization, and part-of-speech (POS) tagging. To describe the corpus content we applied the TextRank algorithm to the lemmatized contents of the documents13,37. Further details about text preprocessing and results of TextRank analysis are presented in the SI Appendix.To map the interdependencies between the SDGs we required distributed numeric representations for which we could calculate similarity measures. This was achieved using a document embedding model (doc2vec), a technique used in NLP to generate numeric representations of documents after inheriting word semantics based on their use in similar lexical contexts in a corpus of training data15. A continuous bag-of-words (CBOW) doc2vec model (5-word window) with 300 dimensions and 250 training iterations was then used to quantify semantic overlap between the P/I documents38. The doc2vec model introduces a document ID parameter into a CBOW word2vec model—a shallow neural network designed to generate word embeddings based on word collocation—to predict words based on the document identifier and broader lexical context. This information is used to generate document embeddings (i.e., numeric vector representations) for each SDG’s P/I document15. Cosine similarity between document embeddings captures discourse proximity or overlap between SDGs in UN reports. These similarities are represented as a weighted network of SDGs, with the weight of each network edge indicating the normalized cosine similarity between UN reports on two SDGs. To identify clusters of similar SDGs, complete-link cluster analysis was conducted on the matrix of cosine similarity scores between SDGs, and the Louvain community-detection algorithm was applied to the network of SDG similarities39,40. When switching from a model trained on the UN P/I corpus (Fig. 1) to a model trained on 11.8 GB of US news reports (Figure S4), the average cosine similarity increased to 0.94 (sd = 0.02) but there were no substantive differences in patterns of discursive similarity between SDGs. Results on discursive similarities are also robust to different specifications of doc2vec models and to other NLP methods such as Latent Semantic Analysis (see SI Appendix).Figure 1Discursive overlap between SDGs based on cosine similarity in doc2vec embeddings. (A) Heatmap of cosine similarity matrix with dendrogram of matrix hierarchical clustering. (B) Weighted network of cosine similarities (node colors represent subgroups identified via the Louvain network community detection algorithm40).To assess the alignment between overlap in SDG UN reports and integration between
https://www.nature.com/articles/s41598-021-01801-6