References

Ackoff, R. L. (1989). From data to wisdom. Journal of Applied Systems Analysis, 16(1), 3–9.
Ädel, A. (2020). Corpus compilation. In M. Paquot & S. Th. Gries (Eds.), A Practical Handbook of Corpus Linguistics (pp. 3–24). Switzerland: Springer.
Albert, S., de Ruiter, L. E., & de Ruiter, J. P. (2015). CABNC: The Jeffersonian transcription of the spoken British National Corpus. TalkBank. Retrieved from https://saulalbert.github.io/CABNC/
Baayen, R. H. (2004). Statistics in psycholinguistics: A critique of some current gold standards. Mental Lexicon Working Papers, 1(1), 1–47.
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge University Press.
Baayen, R. H. (2010). A real experiment is a factorial experiment? The Mental Lexicon, 5(1), 149–157. doi:10.1075/ml.5.1.06baa
Baayen, R. H. (2011). Corpus linguistics and naive discriminative learning. Revista Brasileira de Linguística Aplicada, 11(2), 295–328.
Baayen, R. H., Feldman, L., & Schreuder, R. (2006). Morphological influences on the recognition of monosyllabic monomorphemic words. Journal of Memory and Language, 55, 290–313. doi:10.1016/j.jml.2006.03.008
Baayen, R. H., & Shafaei-Bajestan, E. (2019). languageR: Analyzing linguistic data: A practical introduction to statistics. Retrieved from https://CRAN.R-project.org/package=languageR
Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452–454. doi:10.1038/533452a
Bao, W., Lianju, N., & Yue, K. (2019). Integration of unsupervised and supervised machine learning algorithms for credit risk assessment. Expert Systems with Applications, 128, 301–315. doi:10.1016/j.eswa.2019.02.033
Bengtsson, H. (2024). future: Unified parallel and distributed processing in R for everyone. Retrieved from https://future.futureverse.org
Benoit, K., & Obeng, A. (2024). readtext: Import and handling for plain and formatted text files. Retrieved from https://CRAN.R-project.org/package=readtext
Ben-Shachar, M. S., Makowski, D., Lüdecke, D., Patil, I., Wiernik, B. M., Thériault, R., & Waggoner, P. (2024). effectsize: Indices of effect size. Retrieved from https://easystats.github.io/effectsize/
Blischak, J. D., Carbonetto, P., & Stephens, M. (2019). Creating and sharing reproducible research code the workflowr way. F1000Research, 8(1749). doi:10.12688/f1000research.20843.1
Braginsky, M. (2024). wordbankr: Accessing the wordbank database. Retrieved from https://CRAN.R-project.org/package=wordbankr
Bray, A., Ismay, C., Chasnovski, E., Couch, S., Baumer, B., & Cetinkaya-Rundel, M. (2024). infer: Tidy statistical inference. Retrieved from https://github.com/tidymodels/infer
Bresnan, J. (2007). A few lessons from typology. Linguistic Typology, 11(1), 297–306.
Bresnan, J., Cueni, A., Nikitina, T., & Baayen, R. H. (2007). Predicting the dative alternation. In G. Bouma, I. Kraemer, & J.-W. C. Zwart (Eds.), Cognitive Foundations of Interpretation (pp. 1–33). Amsterdam: KNAW.
Brown, K. (2005). Encyclopedia of language and linguistics (Vol. 1). Elsevier.
Bryan, J., Hester, J., Robinson, D., Wickham, H., & Dervieux, C. (2024). reprex: Prepare reproducible example code via the clipboard. Retrieved from https://reprex.tidyverse.org
Buckheit, J. B., & Donoho, D. L. (1995). Wavelab and reproducible research. In Wavelets and statistics (pp. 55–81). Springer.
Bychkovska, T., & Lee, J. J. (2017). At the same time: Lexical bundles in L1 and L2 university student argumentative writing. Journal of English for Academic Purposes, 30, 38–52. doi:10.1016/j.jeap.2017.10.008
Campbell, L. (2001). The history of linguistics. In M. Aronoff & J. Rees-Miller (Eds.), The Handbook of Linguistics (pp. 81–104). Blackwell Publishers.
Carmi, E., Yates, S. J., Lockley, E., & Pawluczuk, A. (2020). Data citizenship: Rethinking data literacy in the age of disinformation, misinformation, and malinformation. Internet Policy Review, 9(2). Retrieved from https://policyreview.info/articles/analysis/data-citizenship-rethinking-data-literacy-age-disinformation-misinformation-and
Chambers, J. M. (2020). S, R, and data science. Proceedings of the ACM on Programming Languages, 4(HOPL), 1–17. doi:10.1145/3386334
Chan, S. (2014). Routledge encyclopedia of translation technology. Routledge.
Conway, D. (2010, September). The data science Venn diagram. drewconway.com. Retrieved from http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Conway, L. G., Gornick, L. J., Burfeind, C., Mandella, P., Kuenzli, A., Houck, S. C., & Fullerton, D. T. (2012). Does complex or simple rhetoric win elections? An integrative complexity analysis of U.S. Presidential campaigns. Political Psychology, 33(5), 599–618. doi:10.1111/j.1467-9221.2012.00910.x
Cross, N. (2006). Design as a discipline. Designerly Ways of Knowing, 95–103.
Csárdi, G., & Hester, J. (2024). pak: Another approach to package installation. Retrieved from https://pak.r-lib.org/
Csárdi, G., Nepusz, T., Traag, V., Horvát, S., Zanini, F., Noom, D., & Müller, K. (2024). igraph: Network analysis and visualization. Retrieved from https://r.igraph.org/
Data never sleeps 7.0. (2019). Data Never Sleeps 7.0. Infographic. Retrieved from https://www.domo.com/learn/infographic/data-never-sleeps-7
de Marneffe, M.-C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal dependencies. Computational Linguistics, 47(2), 255–308. doi:10.1162/coli_a_00402
Deshors, S. C., & Gries, S. Th. (2016). Profiling verb complementation constructions across new Englishes. International Journal of Corpus Linguistics., 21(2), 192–218.
Desjardins, J. (2019, April). How much data is generated each day? Visual Capitalist. Retrieved from https://www.visualcapitalist.com/how-much-data-is-generated-each-day/
Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), 745–766. doi:10.1080/10618600.2017.1384734
Du Bois, J. W., Chafe, W. L., Meyer, C., Thompson, S. A., Englebretson, R., & Martey, N. (2005). Santa barbara Corpus of Spoken American English, parts 1-4. Philadelphia: Linguistic Data Consortium. Retrieved from https://www.linguistics.ucsb.edu/research/santa-barbara-corpus#Acknowledgements
Dubnjakovic, A., & Tomlin, P. (2010). A practical guide to electronic resources in the humanities. Elsevier.
Duran, P. (2004). Developmental trends in lexical diversity. Applied Linguistics, 25(2), 220–242. doi:10.1093/applin/25.2.220
Eisenstein, J., O’Connor, B., Smith, N. A., & Xing, E. P. (2012). Mapping the geographical diffusion of new words. Computation and Language, 1–13. doi:10.1371/journal.pone.0113114
Firth, J. R. (1957). Papers in linguistics. Oxford University Press.
Francom, J. (2022). Corpus studies of syntax. In G. Goodall (Ed.), The Cambridge Handbook of Experimental Syntax (pp. 687–713). Cambridge University Press.
Gandrud, C. (2015). Reproducible research with R and R studio (Second edition.). CRC Press.
Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644. doi:10.1073/pnas.1720347115
Gentleman, R., & Temple Lang, D. (2007). Statistical analyses and reproducible research. Journal of Computational and Graphical Statistics, 16(1), 1–23.
Gilquin, G., & Gries, S. Th. (2009). Corpora and experimental methods: A state-of-the-art review. Corpus Linguistics and Linguistic Theory, 5(1), 1–26. doi:10.1515/CLLT.2009.001
GitHub. (2024). GitHub. Let’s build from here. Code Repository. Retrieved from https://github.com
Gomez-Uribe, C. A., & Hunt, N. (2015). The Netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS), 6(4), 1–19.
Gries, S. Th. (2013). Statistics for linguistics with R. A practical introduction (2nd revise.).
Gries, S. Th. (2016). Quantitative corpus linguistics with R: A practical introduction (2nd ed.). New York: Routledge. doi:10.4324/9781315746210
Gries, S. Th. (2021). Statistics for linguistics with R. De Gruyter Mouton.
Gries, S. Th. (2023). New technologies and advances in statistical analysis in recent decades. In M. Díaz-Campos & S. Balasch (Eds.), The Handbook of Usage-Based Linguistics (First edition.). John Wiley & Sons Inc.
Gries, S. Th., & Deshors, S. C. (2014). Using regressions to explore deviations between corpus data and a standard/ target: Two suggestions. Corpora, 9(1), 109–136. doi:10.3366/cor.2014.0053
Gries, S. Th., & Paquot, M. (2020). Writing up a corpus-linguistic paper. In M. Paquot & S. Th. Gries (Eds.), A Practical Handbook of Corpus Linguistics (pp. 647–659). Springer International Publishing. doi:10.1007/978-3-030-46216-1_26
Grieve, J., Nini, A., & Guo, D. (2018). Mapping lexical innovation on American social media. Journal of English Linguistics, 46(4), 293–319.
Harris, Z. S. (1954). Distributional structure. Word, 10(2-3), 146–162. doi:10.1080/00437956.1954.11659520
Hay, J. (2002). From speech perception to morphology: Affix ordering revisited. Language, 78(3), 527–555.
Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLOS Biology, 13(3), e1002106. doi:10.1371/journal.pbio.1002106
Hester, J., Wickham, H., & Csárdi, G. (2024). fs: Cross-platform file system operations based on libuv. Retrieved from https://fs.r-lib.org
Hicks, S. C., & Peng, R. D. (2019, July). Elements and principles for characterizing variation between data analyses. arXiv. doi:10.48550/arXiv.1903.07639
Hvitfeldt, E. (2023). textrecipes: Extra recipes for text processing. Retrieved from https://github.com/tidymodels/textrecipes
Ide, N., Baker, C., Fellbaum, C., Fillmore, C., & Passonneau, R. (2008). MASC: The Manually Annotated Sub-Corpus of American English. In 6th International Conference on Language Resources and Evaluation, LREC 2008 (pp. 2455–2460). European Language Resources Association (ELRA).
Ide, N., & Macleod, C. (2001). The American National Corpus: A standardized resource for American English. In Proceedings of Corpus Linguistics. Lancaster UK.
Ignatow, G., & Mihalcea, R. (2017). An introduction to text mining: Research design, data collection, and analysis. Sage Publications.
Jaeger, T. F., & Snider, N. (2007). Implicit learning and syntactic persistence: Surprisal and cumulativity. University of Rochester Working Papers in the Language Sciences, 3(1).
Johnson, K. (2008). Quantitative methods in linguistics. Blackwell Pub.
Kato, A., Ichinose, S., & Kudo, T. (2024). gibasa: An alternative Rcpp wrapper of MeCab. Retrieved from https://CRAN.R-project.org/package=gibasa
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and social psychology review, 2(3), 196–217.
Kloumann, I., Danforth, C., Harris, K., & Bliss, C. (2012). Positivity of the English language. PloS one. doi:10.1371/journal.pone.0029484
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. MT Summit X, 12–16.
Kostić, A., Marković, T., & Baucal, A. (2003). Inflectional morphology and word meaning: Orthogonal or co-implicative cognitive domains? In R. H. Baayen & R. Schreuder (Eds.), Morphological Structure in Language Processing (pp. 1–44). De Gruyter Mouton. doi:10.1515/9783110910186.1
Kowalski, J., & Cavanaugh, R. (2024). TBDBr: Easy access to TalkBankDB via R API. Retrieved from https://github.com/TalkBank/TalkBankDB-R
Krathwohl, D. R. (2002). A revision of Bloom’s Taxonomy: An overview. Theory into practice, 41(4), 212–218.
Kross, S., Carchedi, N., Bauer, B., & Grdina, G. (2020). swirl: Learn R, in R. Retrieved from https://CRAN.R-project.org/package=swirl
Kucera, H., & Francis, W. N. (1967). Computational analysis of present day American English. Brown University Press Providence.
Landau, W. M. (2021). The targets R package: A dynamic make-like function-oriented pipeline toolkit for reproducibility and high-performance computing. Journal of Open Source Software, 6(57), 2959. Retrieved from https://doi.org/10.21105/joss.02959
Larsson, T., & Biber, D. (2024). On the perils of linguistically opaque measures and methods: Toward increased transparency and linguistic interpretability. In P. Crosthwaite (Ed.), Corpora for language learning: Bridging the research-practice divide (pp. 131–141). Taylor & Francis.
Leech, G. (1992). 100 million words of English: The British National Corpus (BNC), (1991), 1–13.
Lewis, M. (2004). Moneyball: The art of winning an unfair game. WW Norton & Company.
Liu, K., & Afzaal, M. (2021). Syntactic complexity in translated and non-translated texts: A corpus-based study of simplification. PLOS ONE, 16(6), e0253454. doi:10.1371/journal.pone.0253454
Lozano, C. (2009). CEDEL2: Corpus escrito del español L2. Applied Linguistics Now: Understanding Language and Mind/La Lingüística Aplicada Hoy: Comprendiendo el Lenguaje y la Mente. Almería: Universidad de Almería, 197–212.
Macwhinney, B. (2024). TalkBank. The TalkBank system. Repository. Retrieved from https://talkbank.org/
Manning, C. (2003). Probabilistic syntax. In Bod, J. Hay, & Jannedy (Eds.), Probabilistic Linguistics (pp. 289–341). Cambridge, MA: MIT Press.
Marwick, B., Boettiger, C., & Mullen, L. (2018). Packaging data analytical work reproducibly using R (and friends). The American Statistician, 72(1), 80–88.
Microsoft. (2024). Visual Studio Code. Code Editing. Redefined. Software. Retrieved from https://code.visualstudio.com/
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
Moroz, G. (2017). lingtypology: Easy mapping for linguistic typology. Retrieved from https://CRAN.R-project.org/package=lingtypology
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074–2102. doi:10.1002/sim.8086
Mosteller, F., & Wallace, D. L. (1963). Inference in an authorship problem. Journal of the American Statistical Association, 58(302), 275–309. Retrieved from https://www.jstor.org/stable/2283270
Mullen, L. (2022). tokenizers: Fast, consistent tokenization of natural language text. Retrieved from https://docs.ropensci.org/tokenizers/
Muñoz, C. (Ed.). (2006). Age and the rate of foreign language learning (1st ed., Vol. 19). Clevedon: Multilingual Matters.
Nisioi, S., Rabinovich, E., Dinu, L. P., & Wintner, S. (2016). A corpus of native, non-native and translated texts. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). Portoro, Slovenia: European Language Resources Association (ELRA).
Nivre, J., De Marneffe, M.-C., Ginter, F., Hajič, J., Manning, C. D., Pyysalo, S., … Zeman, D. (2020). Universal dependencies v2: An evergrowing multilingual treebank collection. arXiv preprint arXiv:2004.10643. Retrieved from https://arxiv.org/abs/2004.10643
Olohan, M. (2008). Leave it out! Using a comparable corpus to investigate aspects of explicitation in translation. Cadernos de Tradução, 153–169.
Ooms, J. (2023). jsonlite: A simple and robust JSON parser and generator for R. Retrieved from https://jeroen.r-universe.dev/jsonlite
Paquot, M., & Gries, S. Th. (Eds.). (2020). A practical handbook of corpus linguistics. Switzerland: Springer.
Pedersen, T. L. (2024). ggraph: An implementation of grammar of graphics for graphs and networks. Retrieved from https://ggraph.data-imaginist.com
Petrenz, P., & Webber, B. (2011). Stable classification of text genres. Computational Linguistics, 37(2), 385–393. doi:10.1162/COLI_a_00052
Posit. (2024). RStudio. RStudio. Software. Retrieved from https://posit.co
R Community. (2024). The comprehensive R archive network. The Comprehensive R Archive Network. Repository. Retrieved from https://cran.r-project.org/
R Special Interest Group on Databases (R-SIG-DB), Wickham, H., & Müller, K. (2024). DBI: R database interface. Retrieved from https://dbi.r-dbi.org
Riehemann, S. Z. (2001). A constructional approach to idioms and word formation (PhD thesis). Stanford.
Rinker, T. (2019). lexicon: Lexicons for text analysis. Retrieved from https://github.com/trinker/lexicon
Robinson, D., & Silge, J. (2024). tidytext: Text mining using dplyr, ggplot2, and other tidy tools. Retrieved from https://juliasilge.github.io/tidytext/
ROpenSci. (2024). The R-Universe System. The R-Universe System. Repository. Retrieved from https://ropensci.org/r-universe/
Rossman, A. J., & Chance, B. L. (2014). Using simulation-based inference for learning introductory statistics. WIREs Computational Statistics, 6(4), 211–221. doi:10.1002/wics.1302
Rowley, J. (2007). The wisdom hierarchy: Representations of the DIKW hierarchy. Journal of Information Science, 33(2), 163–180. doi:10.1177/0165551506070706
Saxena, S., & Gyanchandani, M. (2020). Machine learning methods for computer-aided breast cancer diagnosis using histopathology: A narrative review. Journal of medical imaging and radiation sciences, 51(1), 182–193.
Sedgwick, P. (2015). Units of sampling, observation, and analysis. BMJ (online), 351, h5396. doi:10.1136/bmj.h5396
Serigos, J. (2020). Using automated methods to explore the social stratification of anglicisms in Spanish. Corpus Linguistics and Linguistic Theory, 0(0), 000010151520190052. doi:10.1515/cllt-2019-0052
Shriberg, E. E. (1994). Preliminaries to a theory of speech disfluencies (PhD thesis). University of California at Berkeley.
Silge, J. (2022). janeaustenr: Jane Austen’s complete novels. Retrieved from https://github.com/juliasilge/janeaustenr
Silveira, N., Dozat, T., de Marneffe, M.-C., Bowman, S., Connor, M., Bauer, J., & Manning, C. D. (2014). A gold standard dependency corpus for English. In Proceedings of the ninth international conference on language resources and evaluation (LREC-2014).
Sternberg, R. J., & Sternberg, K. (2010). The psychologist’s companion: A guide to writing scientific papers for students and researchers (5th ed.). Cambridge University Press. doi:10.1017/CBO9780511762024
Szmrecsanyi, B. (2004). On operationalizing syntactic complexity. In Le poids des mots. Proceedings of the 7th international conference on textual data statistical analysis. Louvain-la-Neuve (Vol. 2, pp. 1032–1039).
The R Foundation. (2024). The R project for statistical computing. R: The R Project for Statistical Computing. Software. Retrieved from https://www.r-project.org/
Tottie, G. (2011). Uh and um as sociolinguistic markers in British English. International Journal of Corpus Linguistics, 16(2), 173–197.
University of Colorado Boulder. (2008). Switchboard Dialog Act Corpus. Web download. Linguistic Data Consortium. Retrieved from https://catalog.ldc.upenn.edu/docs/LDC97S62/
Uryu, S. (2024). washoku: Extra ’recipes’ for Japanese text, date and address processing.
US copyright office. (n.d.). Copyright Law of the United States (Title 17). Retrieved from https://www.copyright.gov/title17/
Ushey, K., & Wickham, H. (2024). renv: Project environments. Retrieved from https://rstudio.github.io/renv/
Voigt, R., Camp, N. P., Prabhakaran, V., Hamilton, W. L., Hetey, R. C., Griffiths, C. M., … Eberhardt, J. L. (2017). Language from police body camera footage shows racial disparities in officer respect. Proceedings of the National Academy of Sciences, 114(25), 6521–6526.
Waring, E., Quinn, M., McNamara, A., Arino de la Rubia, E., Zhu, H., & Ellis, S. (2022). skimr: Compact and flexible summaries of data. Retrieved from https://docs.ropensci.org/skimr/
Welbers, K., & van Atteveldt, W. (2022). rsyntax: Extract semantic relations from text by querying and reshaping syntax.
Wenfeng, Q., & Yanyi, W. (2019). jiebaR: Chinese text segmentation. Retrieved from https://CRAN.R-project.org/package=jiebaR
White, J. M. (2023). ProjectTemplate: Automates the creation of new statistical analysis projects. Retrieved from https://CRAN.R-project.org/package=ProjectTemplate
Wickham, H. (2014a). Advanced R. CRC Press.
Wickham, H. (2014b). Tidy data. Journal of Statistical Software, 59(10). doi:10.18637/jss.v059.i10
Wickham, H. (2023a). forcats: Tools for working with categorical variables (factors). Retrieved from https://forcats.tidyverse.org/
Wickham, H. (2023b). stringr: Simple, consistent wrappers for common string operations. Retrieved from https://stringr.tidyverse.org
Wickham, H. (2023c). tidyverse: Easily install and load the Tidyverse. Retrieved from https://tidyverse.tidyverse.org
Wickham, H. (2024). rvest: Easily harvest (scrape) web pages. Retrieved from https://rvest.tidyverse.org/
Wickham, H., & Bryan, J. (2023). R packages: Organize, test, document, and share your code (Second edition.). Beijing: O’Reilly.
Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., … van den Brand, T. (2024). ggplot2: Create elegant data visualisations using the grammar of graphics. Retrieved from https://ggplot2.tidyverse.org
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A grammar of data manipulation. Retrieved from https://dplyr.tidyverse.org
Wickham, H., Girlich, M., & Ruiz, E. (2024). dbplyr: A dplyr back end for databases. Retrieved from https://dbplyr.tidyverse.org/
Wickham, H., & Henry, L. (2023). purrr: Functional programming tools. Retrieved from https://purrr.tidyverse.org/
Wickham, H., Hester, J., & Bryan, J. (2024). readr: Read rectangular text data. Retrieved from https://readr.tidyverse.org
Wickham, H., Miller, E., & Smith, D. (2023). haven: Import and export SPSS, Stata and SAS files. Retrieved from https://haven.tidyverse.org
Wickham, H., Vaughan, D., & Girlich, M. (2024). tidyr: Tidy messy data. Retrieved from https://tidyr.tidyverse.org
Wijffels, J. (2023). udpipe: Tokenization, parts of speech tagging, lemmatization and dependency parsing with the UDPipe ’NLP’ toolkit. Retrieved from https://bnosac.github.io/udpipe/en/index.html
Wijffels, J., & Watanabe, K. (2023). word2vec: Distributed representations of words. Retrieved from https://github.com/bnosac/word2vec
Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLoS Computational Biology, 13(6), 1–20. doi:10.1371/journal.pcbi.1005510
Wulff, S., Stefanowitsch, A., & Gries, S. Th. (2007). Brutal brits and persuasive americans. Aspects of Meaning.
Xie, Y. (2024). tinytex: Helper functions to install and maintain TeX Live, and compile LaTeX documents. Retrieved from https://github.com/rstudio/tinytex
Zipf, G. K. (1949). Human behavior and the principle of least effort. Oxford, England: Addison-Wesley Press.