4  Research

Outcomes

  • Identify a research area and problem by listing key strategies and describing their contribution towards research identification.
  • Explain the significance of a well-framed research question in guiding the overall research project.
  • Comprehend how the conceptual and practical steps involved in developing a research blueprint aid not only the researcher but also the broader scientific community.

In this chapter, we discuss how to frame research, that is how to position your research project’s findings to contribute insight to understanding of the world. We will cover how to connect with the literature, selecting a research area and identifying a research problem, and how to design research best positioned to return relevant findings that will connect with this literature, establishing a research aim and research question. We will round out this chapter with a guide on developing a research blueprint –a working plan to organize the conceptual and practical steps to implement the research effectively and in a way that supports communicating the research findings and the process by which the findings were obtained.

Lessons

What: Project Environment
How: In an R console, load {swirl}, run swirl(), and follow prompts to select the lesson.
Why: To highlight the importance of the computing environment in R for project management and reproducibility.

4.1 Frame

Together a research area, problem, aim and question and the research blueprint that forms the conceptual and practical scaffolding of the project ensure from the outset that the project is solidly grounded in the main characteristics of good research. These characteristics, summarized by Cross (2006), are found in Table 4.1.

Table 4.1: Characteristics of good research (Cross, 2006)
Characteristic Description
Purposive Based on identification of an issue or problem worthy and capable of investigation
Inquisitive Seeking to acquire new knowledge
Informed Conducted from an awareness of previous, related research
Methodical Planned and carried out in a disciplined manner
Communicable Generating and reporting results which are feasible and accessible by others

With these characteristics in mind, let’s get started with the first component to address –connecting with the literature.

4.2 Connect

Research area

The first decision to make in the research process is to identify a research area. A research area is a general area of interest where a researcher wants to derive insight and make a contribution to understanding. For those with an established research trajectory in language, the area of research to address through text analysis will likely be an extension of their prior work. For others, which include new researchers or researchers that want to explore new areas of language research or approach an area through a language-based lens, the choice of area may be less obvious. In either case, the choice of a research area should be guided by a desire to contribute something relevant to a theoretical, applied, and/ or practical matter of personal interest. Personal relevance goes a long way to developing and carrying out purposive and inquisitive research.

So how do we get started? Consider your interests in a language or set of languages, a discipline, a methodology, or some applied area. Language is at the heart of the human experience and therefore found in some fashion anywhere one seeks to find it. But it is a big world and more often than not the general question about what area to explore language use is sometimes the most difficult. To get the ball rolling, it is helpful to peruse disciplinary encyclopedias or handbooks of linguistics and language-related an academic fields (e.g. Encyclopedia of Language and Linguistics (Brown, 2005), A Practical Guide to Electronic Resources in the Humanities (Dubnjakovic & Tomlin, 2010), Routledge encyclopedia of translation technology (Chan, 2014))

A more personal, less academic, approach is to consult online forums, blogs, etc. that one already frequents or can be accessed via an online search. Through social media you may find particular people that maintain a blog worth browsing. Perusing these resources can help spark ideas and highlight the kinds of questions that interest you.

Regardless of whether your inquiry stems from academic, professional, or personal interest, try to connect these findings to academic areas of research. Academic research is highly structured and well-documented and making associations with this network will aid in subsequent steps in developing a research project.

Research problem

Once you’ve made a rough-cut decision about the area of research, it is now time to take a deeper dive into the subject area and jump into the literature. This is where the rich structure of disciplinary research will provide aid to traverse the vast world of academic knowledge and identify a research problem. A research problem highlights a particular topic of debate or uncertainty in existing knowledge which is worthy of study.

Surveying the relevant literature is key to ensuring that your research is informed, that is, connected to previous work. Identifying relevant research to consult can be a bit of a ‘chicken or the egg’ problem –some knowledge of the area is necessary to find relevant topics, some knowledge of the topics is necessary to narrow the area of research. Many times the only way forward is to jump into conducting searches. These can be world-accessible resources (e.g. Google Scholar) or limited-access resources that are provided through an academic institution (e.g. Linguistics and Language Behavior Abstracts, ERIC, PsycINFO, etc.). Some organizations and academic institutions provide research guides to help researcher’s access the primary literature. There are even a new breed of search engines that are designed to help researchers aggregate and search academic literature (e.g. Scite, Elicit, etc.). Another avenue to explore are journals and conference proceedings dedicated to linguistics and language-related research. Text analysis is a rapidly expanding methodology which is being applied to a wide range of research areas.

To explore research related to text analysis it is helpful to start with the (sub)discipline name(s) you identified in when selecting your research area, more specific terms that occur to you or key terms from the literature, and terms such as ‘corpus study’ or ‘corpus-based’. The results from first searches may not turn out to be sources that end up figuring explicitly in your research, but it is important to skim these results and the publications themselves to mine information that can be useful to formulate better and more targeted searches.

Relevant information for honing your searches can be found throughout an academic publication. However, pay particular attention to the abstract, in articles, and the table of contents, in books, and the cited references. Abstracts and tables of contents often include discipline-specific jargon that is commonly used in the field. In some articles, there is even a short list of key terms listed below the abstract which can be extremely useful to seed better and more precise search results. The references section will contain relevant and influential research. Scan these references for publications which appear to narrowing in on topic of interest and treat it like a search in its own right.

Once your searches begin to show promising results it is time to keep track and organize these references. Whether you plan to collect thousands of references over a lifetime of academic research or your aim is centered around one project, software such as Zotero, Mendeley, or BibDesk provide powerful, flexible, and easy-to-use tools to collect, organize, annotate, search, and export references. Citation management software is indispensable for modern research –and often free!

As your list of relevant references grows, you will want to start the investigation process in earnest. Begin skimming (not reading) the contents of each of these publications, starting with what appears to be the most relevant first. Annotate these publications using highlighting features of the citation management software to identify: (1) the stated goal(s) of the research, (2) the data source(s) used, (3) the information drawn from the data source(s), (4) the analysis approach employed, and (5) the main finding(s) of the research as they pertain to the stated goal(s).

Next, in your own words, summarize these five key areas in prose adding your summary to the notes feature of the citation management software. This process will allow you to efficiently gather and document references with the relevant information to guide the identification of a research problem, to guide the formation of your problem statement, and ultimately, to support the literature review that will figure in your project write-up.

From your preliminary annotated summaries you will undoubtedly start to recognize overlapping and contrasting aspects in the research literature. These aspects may be topical, theoretical, methodological, or appear along other lines. Note these aspects and continue to conduct more refine searches, annotate new references, and monitor for any emerging uncertainties, limitations, debates, and/ or contraditions which align with your research interest(s). When a promising pattern takes shape, it is time to engage with a more detailed reading of those references which appear most relevant highlighting the potential gap(s) in the literature.

At this point you can focus energy on more nuanced aspects of a particular gap in the literature with the goal to formulate a problem statement. A problem statement directly acknowledges a gap in the literature and puts a finer point on the nature and relevance of this gap for understanding. This statement reflects your first deliberate attempt to establish a line of inquiry. It will be a targeted, but still somewhat general, statement framing the gap in the literature that will guide subsequent research design decisions.

4.3 Define

Research aim

With a problem statement in hand, it is now time to consider the goal(s) of the research. A research aim frames the type of inquiry to be conducted. Will the research aim to explore, predict, or explain? As you can appreciate, the research aim is directly related to the analysis methods we touched upon in Chapter 3.

To gauge how to frame your research aim, reflect on the literature that led you to your problem statement and the nature of the problem statement itself. If the gap at the center of the problem statement is a lack of knowledge, your research aim may be exploratory. If the gap concerns a conjecture about a relationship, then your research may take a predictive approach. When the gap points to the validation of a relationship, then your research will likely be inferential in nature. Before selecting your research aim it is also helpful to consult the research aims of the primary literature that led you to your research statement.

Typically, a problem statement addressing a subtle, specific issue tends to adopt research objectives similar to prior studies. In contrast, a statement focusing on a broader, more distinct issue is likely to have unique research goals. Yet, this is more of a guideline than a strict rule.

It’s crucial to understand both the existing literature and the nature of various types of analyses. Being clear about your research goals is important to ensure that your study is well-placed to produce results that add value to the current understanding in an informed manner.

Research question

The next step in research design is to craft the research question. A research question is clearly defined statement which identifies an aspect of uncertainty and the particular relationships that this uncertainty concerns. The research question extends and narrows the line of inquiry established in the research statement and research aim. To craft a research question, we can use the research statment for the content and the research aim for the form.

Form

The form of a research question will vary based on the research aim, which as I mentioned, is inimately connected to the analysis approach. For inferential-based research, the research question will actually be a statement, not a question. This statement makes a testable claim about the nature of a particular relationship –i.e. asserts a hypothesis.

For illustration, let’s posit a hypothesis (\(H_1\)), leaving aside the implicit null hypothesis (\(H_0\)), seen in Example 4.1.

Example 4.1 Women use more questions than men in spontaneous conversations.

For predictive- and exploratory-based research, the research question is in fact a question. A reframing of the example hypothesis for a predictive-based research question might take the form seen in Example 4.2.

Example 4.2 Can the number of questions used in spontaneous conversations predict if a speaker is male or female?

And a similar exploratory-based research question might take the form seen in Example 4.3.

Example 4.3 Do men and women differ in terms of the number of questions they use in spontaneous conversations?

The central research interest behind these hypothetical research questions is, admittedly, quite basic. But from these simplified examples, we are able to appreciate the similarities and differences between the forms of research statements that correspond to distinct research aims.

Content

In terms of content, the research question will make reference to two key components. First, is the unit of analysis. The unit of analysis is the entity which the research aims to investigate. For our three example research aims, the unit of analysis is the same, namely speakers. Note, however, that the current unit of analysis is somewhat vague in the example research questions. A more precise unit of analysis would include more information about the population from which the speakers are drawn (i.e. English speakers, American English speakers, American English speakers of the Southeast, etc.).

The second key component is the unit of observation. The unit of observation is the primary element on which the insight into the unit of analysis is derived and in this way constitutes the essential organizational unit of the dataset to be analyzed. In our examples, the unit of observation, again, is unchanged and is spontaneous conversations. Note that while the unit of observation is key to identify as it forms the organizational backbone of the research, it is very common for the research to derive variables from this unit to provide evidence to investigate the research question.

In examples 4.1, 4.2, and 4.3, we identified the number of conversations as part of the research question. Later in the research process it will be key to operationalize this variable. For example, will the number of conversations be the total number of conversations in the dataset or will it be the average number of conversations per speaker? These are important questions to consider as they will influence variable selection, statistical choices, and ultimately the interpretation of the results. Operationalizing the variables is a key part of the research design. Without inclusion and exclusion criteria, the research question is not well-defined and the meaningfulness of the results will be obscured (Larsson & Biber, 2024).

4.4 Blueprint

The efforts to develop a research question will produce a clear and focused line of inquiry with the necessary background literature and a well-defined problem statement that forrms the basis of purposeful, inquisitive, and informed research (returning to Cross’s characteristics of research in Table 4.1).

Moving beyond the research question in the project means developing and laying out the research design in a way such that the research is methodical and communicable. In this textbook, the method to achieve these goals is through the development of a research blueprint. The blueprint includes two components: (1) the conceptual plan and (2) the organizational scaffolding that will support the implementation of the research (Ignatow & Mihalcea, 2017).

In what follows, I will cover the main aspects of developing a research blueprint. I will start with the conceptual plan and then move on to the organizational scaffolding.

Plan

Importance of establishing a feasible research design from the outset and documenting the key aspects required to conduct the research cannot be understated. On the one hand, this process links a conceptual plan to a tangible implementation. In doing so, a researcher is better-positioned to conduct research with a clear view of what will be entailed. On the other hand, a promising research question may present unexpected challenges once a researcher sets about to implement the research. This is not uncommon to encounter issues that require modification or reevaluation of the viability of the project. However, a well-documented research plan will help a researcher to identify and address many of these challenges at the conceptual level before expending unnecessary effort during implementation.

Let’s now consider the subsequent steps to develop a research plan, outlined in Table 4.2.

Table 4.2: Research plan checklist
Step Stage Activity
1 Research Question or Hypothesis Formulate a research question or hypothesis based on a thorough review of existing literature including references. This will guide every subsequent step from data selection to interpretation of results.
2 Data Source(s) Identify viable data source(s) and vet the sample data in light of the research question. Consider to what extent the goal is to generalize findings to a target population, and ensure that the corpus aligns as much as feasible with this target.
3 Key Variables Determine the key variables needed for the research, define how they will be operationalized, and ensure they can be derived from the corpus data. Additionally, identify any features that need to be extracted, recoded, generated, or integrated from other data sources.
4 Analysis Method Choose an appropriate method of analysis to interrogate the dataset. This choice should be in line with your research aim (e.g., exploratory, predictive, or inferential). Be aware of what each method can offer and how it addresses your research question.
5 Interpretation & Evaluation Establish criteria to interpret and evaluate the results. This will be a function of the relationship between the research question and the analysis method.

First, identify a viable data source. Viability includes the accessibility of the data, availability of the data, and the content of the data. If a purported data source is not accessible and/ or it has stringent restrictions on its use, then it is not a viable data source. If a data source is accessible and available, but does not contain the building blocks needed to address the research question, then it is not a viable data source. A corpus resource’s sampling frame should align, to the extent feasible, with the target population(s).

The second step is to identify the key variables needed to conduct the research and then ensure that this information can be derived from the corpus data. The research question will reference the unit of analysis and the unit of observation, but it is important to pinpoint what the key variables will be. We want to envision what needs to be done to derive these variables. There may be features that need to be extracted, recoded, generated, and/ or integrated from other sources to address the research question, as discussed in Chapter 2.

The third step is to identify a method of analysis to interrogate the dataset. The selection of the analysis approach that was part of the research aim (i.e. explore, predict, or explain) and then the research question goes a long way to narrowing the methods that a researcher must consider. But there are a number of factors which will make some methods more appropriate than others.

Exploratory research is the least restricted of the three types of analysis approaches. Although it may be the case that a research will not be able to specify from the outset of a project what the exact analysis methods will be, an attempt to consider what types of analysis methods will be most promising to provide results to address the research question goes a long way to steering a project in the right direction and grounding the research. As with the other analysis approaches, it is important to be aware of what the analysis methods available and what type of information they produce in light of the research question.

For predictive-based research, the informational value of the outcome variable is key to deciding whether the prediction will be a classification task or a regression task. This has downstream effects when it comes time to evaluate and interpret the results. Although the feature engineering process in predictive analyses means that the features do not need to be specified from the outset and can be tweaked and changed as needed during an analysis, it is a good idea to start with a basic sense of what features most likely will be helpful in developing a robust predictive model.

In inferential research, the number and information values of the variables to be analyzed will be of key importance (Gries, 2013). The informational value of the response variable will again narrow the search for the appropriate method and statistical test to employ. The number of explanatory variables also plays an important role. All details need not be nailed down at this point, but it is helpful to have them on your radar to ensure that when the time comes to analyze the data, the appropriate steps are followed.

The last of the main components of the research plan concerns the interpretation and evaluation of the results. This step brings the research plan full circle connecting the research question to the methods employed. It is important to establish from the outset what the criteria will be to evaluate the results. This is in large part a function of the relationship between the research question and the analysis method. For example, in exploratory research, the results will be evaluated qualitatively in terms of the associative patterns that emerge. Predictive and inferential research leans more heavily on quantitative metrics in particular the accuracy of the prediction or the strength of the relationship between the response and explanatory variable(s), respectively. However, these quantitative metrics require qualitative interpretation to determine whether the results are meaningful in light of the research question.

In addition to addressing the steps outlined in Table 4.2, it is also important to document the strengths and shortcomings of the research plan including the data source(s), the information to be extracted from the data, and the analysis methods. If there are potential shortcomings, which there most often are, sketch out contingency plans to address these shortcomings. This will help buttress your research and ensure that your time and effort is well-spent.

Dive deeper

You may consider pre-registering your prospectus to ensure that your plans are well-documented and to provide a timestamp for your research. Pre-registration can also be a helpful way to get feedback on your research from colleagues and experts in the field. Popular pre-registration platforms include Open Science Framework and Center for Open Science.

The research plan together with the information collected to develop the research question is known as a prospectus. A prospectus is a document that outlines the key aspects of the research plan and is used to guide the research process. It is a living document that will be updated as the research progresses and as new information is collected.

Scaffold

The next step in developing a research blueprint is to consider how to physically implement your project. This includes how to organize files and directories in a fashion that both provides the researcher a logical and predictable structure to work with. As the research progresses, the structure will house the data, code, and output of the research as well as the documentation of the research process –together known as a research compendium. In addition to a strong write-up of the research, a research compendium ensures that the research is Communicable.

Communicable research is reproducible research. Reproducibility strategies are a benefit to the researcher (in the moment and in the future) as it leads to better work habits and to better teamwork and it makes changes to the project easier. Reproducibility is also of benefit to the scientific community as shared reproducible research enhances replicability and encourages cumulative knowledge development (Gandrud, 2015).

In Table 4.3, I outline a set of guiding principles that characterize reproducible research (Gentleman & Temple Lang, 2007; Marwick, Boettiger, & Mullen, 2018).

Table 4.3: Reproducible research principles
No. Principle Description
1 Plain text All files should be plain text which means they contain no formatting information other than whitespace.
2 Clear separation There should be a clear separation between the inputs, process steps, and outputs of research. This should be apparent from the directory structure.
3 Original data A separation between original data and data created as part of the research process should be made. Original data should be treated as ‘read-only’. Any changes to the original data should be justified, generated by the code, and documented (see point 7).
4 Modular scripts Each computing file (script) should represent a particular, well-defined step in the research process.
5 Modular files Each script should be modular –that is, each file should correspond to a specific goal in the analysis procedure with input and output only corresponding to this step.
6 Main script The project should be tied together by a ‘main’ script that is used to coordinate the execution of all the project steps.
7 Document everything Everything should be documented. This includes data collection, data preprocessing, processing steps, script code comments, data description in data dictionaries, information about the computing environment and packages used to conduct the analysis, and detailed instructions on how to reproduce the research.

These seven principles in Table 4.3 can be physically implemented in numerous ways. In recent years, there has been a growing number of efforts to create R packages and templates to quickly generate the scaffolding and tools to facilitate reproducible research. Some notable R packages include workflowr (Blischak, Carbonetto, & Stephens, 2019), ProjectTemplate (White, 2023), and targets (Landau, 2021), but there are many other resources for R included on the CRAN Task View for Reproducible Research.

There are many advantages to working with pre-existing frameworks for the savvy R programmer including the ability to quickly generate a project scaffold, to efficiently manage changes to the project, and to buy in to a common framework that is supported by a community of developers.

On the other hand, these frameworks can be a bit daunting for the novice R programmer. At the most basic level, a project can implement the seven principles outlined above with a directory structure and a set of key files seen in Snippet 4.1.

Snippet 4.1 Minimal Project Framework

project/
├── input/
   └── ...
├── code/
   └── ...
├── output/
   └── ...
├── DESCRIPTION
├── Makefile
└── README

The project/ directory is composed of three main sections: input/, code/, and output/ making the destinction between each transparent in the directory structure. The input/ will house the data used and created in the project, ensuring that the original data is kept separate from the data created in the research process. The code/ section will house the scripts that will conduct the project steps including acquiring, curating, transforming, and analyzing the data. These scripts will read and write data and generate output including figures, reports, results, and tables. Lastly, the output/ section will house the resulting output from the project steps.

At the root of the project directory are three files which describe, document, and execute the project. The Makefile is used to automate the execution of the project steps. In effect, it is a script that runs scripts. In addition to coordinating the execution of the project steps, a Makefile will often include commands to set up the computing environment and packages. The README and DESCRIPTION files provide on overview of the project from both a conceptual and technical perspective. The README file includes a description of the project rationale, aims, and findings and instructions on how to reproduce the research. The DESCRIPTION file includes technical information about the computing environment and packages used to conduct the analysis.

The project structure in Snippet 4.1 meets the minimal structural requirements for reproducible research and is a good starting point for a project scaffold. However, aspects of this structure can be adjusted in minimal or more sophisticated ways to meet the needs of a particular project while still conforming to the principles outlined in Table 4.3, as we will see when we return to this topic in Chapter 11.

Activities

The following activities will build on your experience with R and cloning a GitHub repository, and recent experience with understanding the computing environment. The goal will be to bring you up to speed such that you can begin to work on your own research projects and understand how to use the tools and resources available to you to manage your project.

Recipe

What: Understanding the computing environment
How: Read Recipe 4, complete comprehension check, and prepare for Lab 4.
Why: To introduce components of the computing environment and how to manage a reproducible research project structure.

Lab

What: Scaffolding reproducible research
How: Clone, fork, and complete the steps in Lab 4.
Why: To establish a repository and project structure for reproducible research and apply new Git and Github skills to fork, clone, commit, and push changes.

Summary

The aim of this chapter is to provide the key conceptual and practical points to guide the development of a viable research project. Good research is purposive, inquisitive, informed, methodological, and communicable. It is not, however, always a linear process. Exploring your area(s) of interest and connecting with existing work will help couch and refine your research. But practical considerations, such as the existence of viable data, technical skills, and/ or time constrains, sometimes pose challenges and require a researcher to rethink and/ or redirect the research in sometimes small and other times more significant ways. The process of formulating a research question and developing a viable research plan is key to supporting viable, successful, and insightful research. To ensure that the effort to derive insight from data is of most value to the researcher and the research community, the research should strive to be methodological and communicable adopting best practices for reproducible research.