Naming conventions for variables, functions, etc.

The golden rule of code layout is that code should be written to be readable. And that means readable by others, including you in the future.

Three principles of naming follow:

1. Names should mean something.

2. Names should be as short as possible.

3. Use your judgement to balance (1) and (2).

The third one’s where all the fun arises. Do we use “i” or “n” for integer loop variables by convention? Yes, we do. Do we choose “inv_logit” or “inverse_logit”? Stan chose “inv_logit”. Do we choose “complex” or “complex_number”? C++ chose “complex”, as well as choosing “imag” over “imaginary” for the method to pull the imaginary component out.

Do we use names like “run_helper_function”, which is both long and provides zero clue as to what it does? We don’t if we want to do unto others as we’d have them do unto us.

P.S. If the producers of Silicon Valley had asked me, Winnie would’ve dumped Richard after a fight about Hungarian notation, not tabs vs. spaces.

21 thoughts on “Naming conventions for variables, functions, etc.

    • Floating point absolute value for those who aren’t C++ or Stan coders. In retrospect, I see I made a mistake in just following C++ function naming conventions in Stan. I should’ve just went with:

      real abs(real)
      int abs(int)
      

      I still plan to deprecate `fabs` and go back to that. Part of the reason we didn’t do that was the C99/C++03 conflicts that have been better sorted in C++11 and our getting better at traits metaprogramming.

      Instead, what we have in Stan now is

      real fabs(real)
      real fabs(int)
      int abs(int)
      

      But the middle one’s not necessary anywhere.

      I hadn’t quite anticipated how hard users trained in R would find the distinction between integers and floating-point value, which is fundamentally baked into our CPUs.

        • The natural numbers (0, 1, …) and integers (…, -1, 0, 1, …) are very natural mathematically. So much so there’s a commonly used mathematical notation (N and Z). Their algebra’s different than that of real numbers. For example integers aren’t closed under addition, making the expression “1 / 2” a thorny issue. In C++, you get integer rounding, and the result is 1; in R, if you evaluate “1 / 2” you get 0.5.

          But the real problem here isn’t what the ideal mathematician would do, but rather what a programmer in 2020 has to do on their computer to get code to work efficiently and transparently (and by transparently, I mean according to the IEEE 754 spec, which is where the float64 behavior is defined).

        • > integers aren’t closed under addition, making the expression “1 / 2” a thorny issue

          pretty sure you meant aren’t closed under *division*, but yes.

          I guess what I meant was that a typical “modeler” is thinking about real numbers, which the integers are a subset of. So if you already admit the entirety of the real line into your concept of “number”, then there’s no distinction about “the integers are special”. But the CPU makes this distinction. The distinction is artificial as far as a person working with real numbers is concerned. It doesn’t “help” them in any way mathematically, on the other hand, computationally it’s entirely a different set of instructions and different speed.

        • If someone is programming, they have to learn at least a little bit about these things, that’s just the way it is. I used to prefer interpreted languages, but I despise R so much I have come around to prefering strongly typed languages.

          Type casting can have catastrophic consequences

          https://around.com/ariane.html

        • Just be glad for IEEE 754.
          1) The SAS folks had to work very hard to get the exact same results on IBM mainframes, DEC VAX machines, and IEEE &54 microprocessors.

          2) When we were doing SPEC benchmarks ~1989, we had to use fuzzy comparisons & algorithms whose iteration counts didn’t vary drastically or depend on least significant bit. One benchmark dropped out because it gave noticably different results for VAX, Motorola 68K machines and RISC micros (754).

          You may recall regarding PL/I, “The product exhibits some lovely features but, unfortunately, the expression

          25 + 1/3

          yields 5.33333333333333”

          https://web.archive.org/web/20200201104326/https://plg.uwaterloo.ca/~holt/papers/fatal_disease.html

        • Obmention of Leopold Kronecker’s quote (or at least attributed quote): “God made the integers; all else is the work of man.”

  1. This is what makes programming in R such a pain. So many different function names for doing the same thing. If I want to change the values of a column I have to use “transmute”. If I want to change the values of a list I have to use use “map” or one of the other “map_” variants. If I want to change the values of levels I have to use fct_recode. If I want to change the values of column names I have to use df.colnames = c(‘a’, ‘b’, ‘c’). I’d rather have one poorly named function that works everywhere, than 5 well named functions that work only on specific types.

    • Note that this isn’t the R language or even the standard library, but rather the tidyverse. Base R does most of that via indexing for values and names() for names.

    • And the long form of this quote: “There are only two hard things in computer science: cache invalidation and naming things and off-by-one errors.”

  2. Here are the variable names I used in a set of R functions I came up with this past weekend. I think I probably get a failing grade from Bob…

    covV
    pd
    typ
    dcv
    iir
    rwQuant
    fsQuant
    Qual
    covS
    InclV
    amb
    Xamb
    x
    y
    InclH
    PRI
    InclS
    VeriCov
    vp
    expr
    lastrow
    res1V
    res1H
    res1
    res2V
    res2H
    res2
    resH
    resV
    stats
    veriplot
    whiteout

    Although in fairness, the two most egregious ones (x and y) were chosen because they are the de facto standard variables in the literature for the calculations I was implementing.

    • If you use autocomplete and name “backwards” (specific to general):

      subsubgroupname_subgroupname_groupname_supergroupname

      autocomplete is your friend and length doesn’t matter that much.

    • In R, I’ve tended to extend the convention that dataframes are called something.df. So a gam model is something.gam and a nlme model is something.nlme. Since R is all about objects then it’s nice to know to have the type of object in the name.

      In surveys/questionnaires/medical charts, I got taught the convention of having the question number as the start of the name. So, question 4 in section B which asked about diabetes would become B4_diab. But that was back when SAS only allowed 8 letters in a variable name but I think that is still useful even when variable names can be longer.

      • We have one survey dataset in SAS with somewhere upwards of 1,700 variables (it’s a merge of multiple surveys, each repeated at several points in time in a “wide” format). If we still had 8-character variable name restriction in SAS I’m not sure what the heck we’d be using for names.

      • @mpledger: That’s Hungarian notation for R.

        @jim: I should have clarified that I meant for writing programs, not interactive scripting in a REPL environment like R’s or Python’s.

        Research code done for exploration is a bit diffeent than that done for a submitted paper, but neither demand the kind of naming effort required for a code base with a couple dozen developers.

        For reference, it ight be easy to type this with autocomplete:

        very_long_variable_name_one[my_first_long_index, my_second_long_index] = very_long_variable_name_one[my_first_long_index, my_midde_long_index] * variable_long_variable_two_name[my_middle_long_index, my_second_long_index];
        

        but it’s hard to see the basic structure compared to:

        a[i, j] = a[i, k] * b[k, j];
        

Leave a Reply

Your email address will not be published. Required fields are marked *