Skip to main content

Visualizing texts: a tool for generating thematic-progression diagrams


The Hallidayan notion of Theme, first presented in a series of seminal papers in the 1960s (Halliday 1967, 1968), is a central component of the message structure of the clause, and forms part of a larger social-functional theory of language. The theory regards the functions of language as primary and accounts for how language both acts upon and is constrained by the social context in which it functions (Halliday 1973; Halliday and Hasan 1985). These contexts are stated in terms of language metafunctions—the ideational (comprising experiential and logical), interpersonal, and textual. While the ideational and interpersonal metafunctions construe our experiences of the world and establish interpersonal relations between or among the discourse participants, the textual metafunction packages such experiences and interpersonal relations so that they can be meaningfully conveyed through language. The textual metafunction, to which Theme belongs, thus performs a facilitating or enabling role, allowing for the message in the clause, and by extension the larger text, to be ordered and developed.

In the Hallidayan framework, Theme is glossed as “the point of departure” of the clausal message (Halliday and Matthiessen 2014, 83). Specifically, “[o]ne part of the clause is enunciated as the Theme; this then combines with the remainder so that the two parts together constitute a message” (Halliday and Matthiessen 2014, 88). In English, clause-initial grammatical constituents are categorized as textual, interpersonal, or topical Themes, as listed in Table 1. These thematic elements correspond to the textual, interpersonal, and experiential metafunctions of language.

Table 1 Textual, interpersonal, and topical Themes in the Hallidayan framework

The topical Theme is the most important of the three Theme types; it comprises only one experiential element and ends the thematic portion of the clause. Halliday and Matthiessen (2014, 111–112) argue that unless this constituent appears, “the clause lacks an anchorage in the realm of experience”. The thematic portion, therefore, extends from the beginning of the clause up to and including the topical Theme; the remainder of the clause is known as Rheme. The topical Theme need not be preceded by textual or interpersonal Themes, which are optional.

Recent developments in text-based studies

The Hallidayan Theme-Rheme framework has been extensively applied in corpus-based studies over the years, offering valuable insights into the use and choice of Themes in news articles (Liu and Tucker 2015, Rahnemoon et al. 2017), student writing (Alyousef 2016; North 2005), research articles (Gosden 1992), and sales-promotion communication (Cheung 2011), among others.

Most recently, studies have attempted to capture the progression of topical Themes through the text to arrive at a thematic diagram of the text as a whole (e.g. Leong 2015, 2016; Leong et al. 2018). The idea of thematic progression (TP) was proposed by Daneš (1970, 1974), who first drew attention to the patterned inter-relationships between and among Themes and Rhemes in the text. He identified various broad patterns, two of which are shown in (1–2) and Figs. 1 and 2. Clause boundaries are indicated by |||, and embedded clauses are enclosed within double square brackets [[…]]. Topical Themes are in bold typeface. Text examples in this report are taken from the extended abstract of a research article on Saturn’s moons (Buratti et al. 2019); the article was published in 2019 in the journal Science, and was the most recent article at the time of preparing this report. The extended abstract is reproduced in full in the Appendix.

  1. (1)

    ||| Saturn’s main ring system is associated with a family of small moons. ||| Pan and Daphnis orbit within the A-ring’s Encke Gap and Keeler Gap, respectively, ||| […]

  2. (2)

    ||| Saturn’s ring moons record a complex geologic history with groove formation [[caused by tidal stresses and accretion of ring particles.]] ||| The moons [[embedded within the rings or near their edges]] have solid cores with equatorial ridges of more weakly consolidated material. |||

Fig. 1
figure 1

Simple-linear TP

Fig. 2
figure 2

Constant TP

In reality, it is far more usual for a text to display a combination of these and other patterns. Analysts have noted, for example, the gapped pattern, where the progression of a Theme is interrupted by a clause or a collection of clauses (Dubois 1987), as well as the boxed pattern, where both elements of a Theme-Rheme pair are each linked to a corresponding element in the following pair (Leong 2005).

Incorporating Daneš’s TP in text-based studies raises at least two challenges. First, TP diagrams, particularly of extended texts, are difficult and time-consuming to produce. Until recently, text-based studies on TP have avoided including TP diagrams (Hawes 2015; Jalilifar 2010), due most likely to the immense effort required to produce such diagrams. Any mention of a TP pattern (e.g. simple-linear TP) tends to be restricted to specific portions of the text. While this is an understandable compromise, the text on the whole may itself take a particular thematic shape, which may not be visible without the help of a text-level diagram. Second, even if such TP diagrams are incorporated, an objective comparison of these diagrams cannot be easily achieved without quantification. This is not quite the same thing as counting the number of simple-linear or constant TPs in the text, as some studies have sought to do (e.g. Hawes 2015). This is because such numbers are restricted to identifiable TPs, but they may not capture the global pattern of the text in its entirety.

Leong (2015, 2016) and (Leong et al. 2018) addressed these challenges by using simplified TP diagrams that included only topical Themes, rather than both topical Themes and Rhemes (as in Daneš’s original diagrams). The simplification avoided any cluttering in the diagrams by highlighting only the development of the topical Themes in the text. This diagram was plotted using the Microsoft Excel program. Each row in the spreadsheet represented a clause, and each column, narrowed appropriately to resemble a small square, represented an idea expressed by the Theme of a particular clause. These ideas were termed ‘semantic labels’, and new labels were added as necessary following the criteria proposed by (Martin 1992):

Identity chains are based on co-referentiality, which is realised through pronominal cohesion, instantial equivalence, the definite article and demonstratives (or lexical repetition if the reference is generic) … (419)

An example of a partial TP diagram, using the Introduction and Rationale sections of the extended abstract in (3) (see the Appendix), is given in Fig. 3. The unit of analysis in this case is the T-unit, short for “minimal terminable unit”, an expression first used by Hunt (1965). A T-unit comprises one independent clause, which may be accompanied by any dependent clause(s) associated with it. This unit of analysis has been used in several text-based studies (e.g. Jalilifar 2010; McCabe 1999; Williams 2009). Fries and Francis (1992) note that T-units allow for the TP of a text to be more easily discerned since “the structure of beta [dependent] clauses, including their thematic structure, tends to be constrained by the alpha [independent] clauses” (47). In (3), reference numbers prefixed by ‘R’ are used in place of clause-boundary markers so that each row in Fig. 3 can be easily matched with the unit of analysis in the text.

(3) R0001 Saturn’s main ring system is associated with a family of small moons.
 R0002 Pan and Daphnis orbit within the A-ring’s Encke Gap and Keeler Gap, respectively,
 R0003 whereas Pandora and Prometheus orbit just outside the F-ring
 R0004 and Atlas [orbits] just outside the A-ring.
 R0005 The latter three moons help to confine ring particles.
 R0006 The moons Janus and Epimetheus are in closely spaced orbits [[that they exchange approximately every 4 years;]]
 R0007 these two objects may be collisional fragments of a larger body.
 R0008 All these moons have densities much less than 1000 kg/m3, indicating [[that they formed from ring debris [[that accumulated around a preexisting core.]]]]
 R0009 During the final stages of the Cassini mission, the spacecraft made a series of close observations of Saturn’s rings.
 R0010 Flybys of Pan, Daphnis, Pandora, Atlas, and Epimetheus were performed to investigate the geologic processes [[shaping their surfaces, their composition, their thermal and ultraviolet properties, their relationship to Saturn’s ring system, and their interactions with particles in Saturn’s magnetosphere.]]
Fig. 3
figure 3

Example of a TP diagram using the Microsoft Excel program

The choice of semantic labels depends on the judgment of the analyst and the text being analyzed. In the case of (3), since the moons are talked about collectively, I used only a single label ‘Moons’ to refer to all of Saturn’s moons. In a different text with a specific focus on a particular moon or two, the analyst must then come up with separate semantic labels to capture the distinction.

As regards the quantification of such simplified TP diagrams, Leong (2016) proposed a thematic-density index (TDI), which is obtained by dividing the number of analytical units (C) by the number of semantic labels for each text. The analytical unit can be an individual clause, a T-unit, or a sentence, depending on the nature of the study. The formula for the TDI is given in (4):

  1. (4)

    TDI = \( \frac{\mathrm{Number}\ \mathrm{of}\ \mathrm{analytical}\ \mathrm{units}}{\mathrm{Number}\ \mathrm{of}\ \mathrm{semantic}\ \mathrm{labels}} \)

From the TP diagram in Fig. 3, the TDI can be easily computed since it is merely a matter of dividing the number of rows by the number of columns. The TDI thus gives us a numerical value of the number of analytical units per topical Theme. In the case of (3), this value is \( \frac{10}{3} \) or 3.33 (it should be noted that this is only the value of the Introduction and Rationale sections of the abstract). The possible range of TDI values is 1 to C. At one extreme end of the index, the TDI is 1, where the number of topical Themes (= C, since each analytical unit contains one topical Theme) equals the number of semantic labels; each topical Theme, in other words, corresponds to a different idea. At the other extreme end of the index, the value is C, where all the topical Themes correspond to only a single idea. These two extreme TDI values represent the two canonical thematic structures that a text can take at the macro level—a simple-linear TP (where TDI = 1) and a constant TP (where TDI = C). Since an extended text very rarely takes such extreme values, it is far more common to describe a text as exhibiting a general simple-linear TP structure or a general constant TP structure. For instance, in his work on scientific research articles, Leong (2015) observed that a typical text had both a general simple-linear shape in the Introduction section followed by an “anchored development” in the rest of the article (see Fig. 4).

Fig. 4
figure 4

Thematic structure of a scientific research article, illustrating a general simple-linear TP followed by a general constant TP (Leong 2015, 304)

The simplified TP diagrams and the TDI go some way to help us both view the thematic shape of the entire text and assign a numerical value to this shape to allow for thematic diagrams to be statistically compared and tested (e.g. using analysis of variance). However, in the three cited studies that used the simplified TP diagrams, the plotting of the diagrams and the computing of the TDIs were still performed manually. Since then, I have re-formatted the Microsoft Excel program to automate both these processes. The technical details are detailed in the next section.

A (partially) automated tool

The tool takes the form of a formatted Microsoft Excel template. I should clarify that the template does not automate all the analytical processes; it automates only the construction of the diagram and the computation of the TDI. The text to be analyzed still needs to be manually prepared by the analyst. The program, for instance, cannot automatically divide up the text into pre-determined units of analysis since these units may vary in different studies. In large corpus-based studies, for instance, it may be more convenient to consider the T-unit or sentence, rather than the individual clause, as the basic unit of analysis.

The template is available for download, without charge, from the following web page. The user is also free to customize the template to meet her/his preferences.

The template contains two worksheets—Data and Diagram. The Data worksheet is where the analysis of the text is performed. The Diagram worksheet contains the TP diagram, which is updated automatically in response to any change in the Data worksheet. This allows the analyst to see, as it were, the plotting of the diagram in real time. The updating of the TP diagram involves a Visual Basic project, and for this reason, the entire template must be saved as a macro-enabled file (i.e. it should carry the ‘.xlsm’ suffix). When accessing the template, this also requires the user to allow ‘Enable Content’ if she/he receives a security warning about macros having been disabled.

Using the template is detailed in the sub-sections below. In the analysis, the extended abstract is used as the example text, and the T-unit is adopted as the unit of analysis.

Preparatory stage

Before using the template, the user should first prepare the text as follows:

  1. (a)

    All formatting in the text should preferably be removed. This can be done by converting the document into a text-only file.

  2. (b)

    The text is then divided up into appropriate units of analysis (e.g. a clause, T-unit, or sentence). Each unit of analysis should be on a separate line; there should be no blank line(s) separating one unit of analysis from the next. The units of analysis need not be numbered since the rows in the template already contain reference numbers. The text, properly prepared, should resemble Fig. 5.

Fig. 5
figure 5

Prepared text for analysis; each unit of analysis is a separate line

Data worksheet

The Data worksheet has been preset to display 1000 rows for the units of analysis. This is headed by the label ‘Clauses’ in the template, but the user can amend it to a more suitable header depending on her/his study (e.g. ‘T-unit’, ‘Sentence’). These rows are marked out as ‘R0001’, ‘R0002’, ‘R0003’, etc. These reference numbers can of course be customized, but they should be accompanied by a letter. This is to prevent the program from mistakenly regarding them as numerical values, which they are not.

The second column—Text—is where the prepared text is inserted. This is done by copying and pasting the prepared text into the column (see Fig. 6).

Fig. 6
figure 6

Data worksheet with example text

The third column—Semantic Labels—contains the labels assigned to the topical Themes. The user may come up with any label, as appropriate, to reflect the semantic content of each topical Theme. The following, however, must be observed:

  1. (a)

    A number must be inserted in front of each semantic label. This is obligatory for the template to order the semantic labels sequentially; the numbers are critical to prevent sorting errors. If the semantic labels are not preceded by a number, the program will sort the semantic labels alphabetically instead, resulting in an inaccurate TP diagram. Hence, the first semantic label should be preceded by ‘01’ (e.g. ‘01 Ring system’).

  2. (b)

    A different number should be assigned to only new labels. For example, if another topical Theme corresponds to the same idea of Saturn’s ring system, the semantic label ‘01 Ring system’ is reused. But if the topical Theme corresponds to a new idea—e.g., Saturn’s moons—then a new semantic label should be created (e.g. ‘02 Moons’).

The user can keep track of the semantic labels by consulting the table titled Themes per Semantic Label (on the right of the Data worksheet). This table is automatically updated whenever a new label is created. As the semantic labels in this table are automatically sorted, it also helpfully informs the user about the new number to be assigned to a new semantic label.

The fourth column—Frequency—is automatically populated whenever a semantic label is inserted. No action is required from the user for this column.

At the end of the analysis, the user will notice a row in the Themes per Semantic Label table marked ‘(blank)’ (see highlighted row in Fig. 7). This is because the template has 1000 preset rows, of which only 30 have been filled with the example text and semantic labels. We need to therefore delete all the unused rows.

Fig. 7
figure 7

‘Blank’ row in the ‘Themes per Semantic Label’ table

Deleting the rows is not simply a matter of removing the entries in the first column from ‘R0031’ onwards. Instead, the entire unused rows should be selected, followed by ‘Delete’ > ‘Entire row’ (see Fig. 8).

Fig. 8
figure 8

Deleting unused rows by selecting (a) ‘Delete’, followed by (b) ‘Entire Row’

Once the unused rows have been deleted, the analysis is now complete. Summary statistics are provided on the right of the Data worksheet. They reveal the following:

  1. (a)

    TDI—this is calculated by dividing the number of Themes (= number of rows) by the number of semantic labels used.

  2. (b)

    Themes per semantic label—the numbers are also expressed as percentage figures (of the total number of Themes).

In summary, the only two columns in which action is required from the user are the Text and Semantic Labels columns. There is no need to change anything elsewhere (apart from deleting the unused rows at the end of the analysis).

Diagram worksheet

The Diagram worksheet captures the TP diagram of the analyzed text. The diagram is automatically refreshed to reflect all the semantic labels, appropriately sorted, in the Data worksheet. The TP diagram of the extended abstract is given in Fig. 9.

Fig. 9
figure 9

TP diagram of the extended abstract

The user can also view the entire diagram on a single screen by clicking the ‘View TP Diagram’ button at the top left corner of the worksheet. As we can see from Fig. 9, an anchored development, centering on ‘Moons’, is rather visible. While there are developments into other areas such as reflectance spectra or the effects of materials, the description remains firmly grounded on the moons.


The template permits the analyst to both visualize and quantify the thematic structure of any extended text. As comparing diagrams alone carries the risk of subjective bias, the template computes the TDI for each diagram to allow different diagrams to be statistically compared. In a corpus-based study, the use of the template can be useful to investigate whether a particular text type displays a common structure. For instance, the studies by Leong (2015) and Leong et al. (2018) have shown that scientific research articles appear to display a common thematic shape—a general simple-linear TP in the Introduction section, followed by an anchored development in the subsequent sections (see Fig. 4). Indeed, as Leong et al. (2018) suggest, this common shape reflects the fundamental two-stage approach to scientific research writing: (1) through the use of the simple-linear TP in the Introduction section, scientists narrow the description towards their research focus, and (2) through the use of the constant TP in the rest of the article, they focus on what they found.

The template can be customized and used flexibly. The Hallidayan account, for instance, regards unmarked non-finite clauses in English as being Themeless. These can be easily accommodated in the template by specifying, say, ‘00 Themeless’ as a semantic label so that such Themeless blots appear at the left edge of the TP diagram. They can be ignored when considering the overall thematic shape of the text or, if they are pervasive enough, a comment can also be made about how they affect the thematic development of the larger text. Also, while this report has centered on the Hallidayan framework. I am mindful that there are other conceptions of Theme (e.g. Berry 2013), and the delimitation of Theme in other languages can be rather different from how it is done in English (e.g. Moyano 2016). Such differences do not affect the applicability of the template. It is merely a tool to allow the user to track the semantic idea in each Theme as part of a larger diagram.

The one obvious drawback of the template, however, is its rather large file size. The blank template itself is 9.47 MB in size, and a less-powerful computer system may encounter some sluggishness when using the template. This is not a problem with a ready solution since the template is based on the Microsoft Excel program. It is hoped that in future versions of the program, this problem can be minimized.

One might also argue that the simplified TP diagrams do not quite meet the original intent of Daneš’s diagrams, which include Rhemes. I might point out, however, that including Rhemes in the diagrams would complicate the thematic shape unnecessarily. This is because a Rheme, by definition, is the remainder of the clause beyond the topical Theme. Depending on the clause in question, the rhematic segment can get lengthy and so requires more than one semantic label to capture its content adequately. This complexity is worsened if T-units are chosen as the basic unit of analysis. Since the T-unit may, and often does, include subordinate clauses, the portion rendered as ‘Rheme’ will be even lengthier, requiring even more semantic labels. The TP diagrams proposed here are called simplified for precisely this reason; they seek to chart the development and progression of only Themes in the text. More crucially, what I hope to have shown in this report is that a ‘picture’ of the text (with an accompanying numerical value) can be produced easily with nothing more than a thematic analysis typically undertaken in text-based studies. It is further hoped that this can be of some value for education and research purposes.

Availability of data and materials

All data generated or analyzed during this study are included in this published article [and its supplementary information files].


Download references


Not applicable.


No funding was received.

Author information

Authors and Affiliations



The author prepared, read, and approved the final manuscript.

Corresponding author

Correspondence to Ping Alvin Leong.

Ethics declarations

Competing interests

The author declares that he has no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



Close Cassini flybys of Saturn’s ring moons Pan, Daphnis, Atlas, Pandora, and Epimetheus

Extended Abstract


Saturn’s main ring system is associated with a family of small moons. Pan and Daphnis orbit within the A-ring’s Encke Gap and Keeler Gap, respectively, whereas Pandora and Prometheus orbit just outside the F-ring and Atlas just outside the A-ring. The latter three moons help to confine ring particles.

The moons Janus and Epimetheus are in closely spaced orbits that they exchange approximately every 4 years; these two objects may be collisional fragments of a larger body. All these moons have densities much less than 1000 kg/m3, indicating that they formed from ring debris that accumulated around a preexisting core.


During the final stages of the Cassini mission, the spacecraft made a series of close observations of Saturn’s rings. Flybys of Pan, Daphnis, Pandora, Atlas, and Epimetheus were performed to investigate the geologic processes shaping their surfaces, their composition, their thermal and ultraviolet properties, their relationship to Saturn’s ring system, and their interactions with particles in Saturn’s magnetosphere.


The moons that orbit in ring gaps or are adjacent to the main rings have equatorial ridges of material consisting of accreted particles that are distinct from their rounded central cores. The cores are more structurally sound than ridges, with rougher surfaces and more impact craters. Complex patterns of grooves formed by tidal stresses crisscross the moons.

A visible-infrared reflectance spectrum of Pan, which is embedded in the rings, shows that it is redder than any of the other ring moons. The color of the moons becomes more red as the distance to Enceladus increases. This suggests that the optical properties of the moons are determined by the balance of two external effects: addition of a red coloring agent from the main rings, and accretion of neutral-colored icy particles or water vapor possibly from the E-ring, which is formed by Enceladus’s plume. The exact composition of the red material from the main ring system is unknown, although a mixture containing organic silicates and iron is likely. Differences in particle size also affect the moons’ spectra.

Measurements of the spectral slope of Epimetheus in the ultraviolet suggest that it is less affected by particles from the E-ring than are the mid-sized main moons farther from Saturn. Temperature maps were derived for both Atlas and Epimetheus, whose blackbody temperatures were 82 ± 5 K and 90 ± 3 K, respectively.

Carbon dioxide, which is present on the eight mid-sized saturnian moons, was not detected on the ring moons, nor were any volatiles other than water ice. Measurements showed a scarcity of high-energy ions in the vicinity of the ring moons and only transient energetic electron populations; in the ring gaps, no trapped electron or proton radiation was detected. Although particle bombardment alters both the albedo and color of the main moons’ surfaces, for the ring moons it appears to be unimportant.


Saturn’s ring moons record a complex geologic history with groove formation caused by tidal stresses and accretion of ring particles. The moons embedded within the rings or near their edges have solid cores with equatorial ridges of more weakly consolidated material. The finding of a porous surface further supports substantial accretion. High-resolution images strongly suggest exposures of a solid substrate distinct from the mobile regolith that frequently covers essentially all small Solar System objects. These exposures may eventually help to reveal systematic trends of the evolution of moons and their geologic structures for the whole of Saturn’s satellite system.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Leong, P.A. Visualizing texts: a tool for generating thematic-progression diagrams. Functional Linguist. 6, 4 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: