All the little decisions we have to make when developing public-facing software

Jonah, Philip, Gustavo, and I are putting together an R package, caliBISG, implementing our calibrated Bayesian improved surname geocoding algorithm. The background to the awkward name is that BISG already existed, and in our recent paper we added a calibration step.

Setting up the R package required many steps (none of which were done by me). Below are some excerpts from the email thread.

Gustavo:

Testing the package–

I think the last name ‘Lee’ is a good illustrative example, so I ran it for Spokane county (2% Asian) and King county (Where Seattle is; Almost 20% Asian ).

> print_comparison_tables(race_probabilities(c(“Lee”, “Lee”), c(“wa”, “wa”), c(“spokane”, “king”)))
Surname: Lee
State: WA
County: Spokane
Year: 2020

Race Pr_calibisg Pr_bisg
—————————————-
API 0.11 0.27
White 0.76 0.62

Surname: Lee
State: WA
County: King
Year: 2020

Race Pr_calibisg Pr_bisg
—————————————-
API 0.52 0.75
White 0.35 0.18
—————————————-

Here you already see the huge difference in estimates that caliBISG offers relative to BISG–caliBISG’s estimates are much closer to the 45% Asian proportion of Lees in the U.S. census in 2010. BISG’s 75% number is huge even if King county has a large Asian population. The package was incredibly easy to use to do this.

Comments:

– Would be very useful to be able to aggregate counties or states. For example, if I want to compare probabilities or predict race in the greater Dallas area, I would want to aggregate across 11 counties all within Texas. If I wanted to do the same for the Philadelphia metro area, I would want to aggregate across counties in NJ, DE, and PA.

– Not much else. I used all of the functions in the package with ease and the documentation was clear.

– Slightly getting ahead of myself, but if we can pass the output to the functions in this package or write our own, it wouldn’t take that much to put a paper about racially polarized voting together.

– Can we add a function for seeing all of the names together? I know the dataframe would be enormous, but seeing just the top 100 white, Black, Hispanic, etc. names would be useful in some contexts.

Minor comments:

– I had to use a personal access token to download the package—I’m guessing because RStudio isn’t connected to my github and the repo is private.

– Slightly more straightforward to rename print_comparison_tables to just print (since you can mask the function and use the compare_bisg class)

– I went ahead and downloaded a third state, but probabilities couldn’t be returned for it (Florida)–I’m guessing that’s just not set up on the back end yet though.

All in all, this is looking great. Thanks for all the work behind this Jonah and Philip!

Jonah:

Below are some responses to your comments and questions.

Would be very useful to be able to aggregate counties or states. For example, if I want to compare probabilities or predict race in the greater Dallas area, I would want to aggregate across 11 counties all within Texas. If I wanted to do the same for the Philadelphia metro area, I would want to aggregate across counties in NJ, DE, and PA.

Yeah I could imagine that being very useful. Can you say more about this? Imagine you’re talking to someone who thinks more like a software developer than a social science researcher (which is basically true of me, although working with Andrew I’ve been involved in various social science projects). How are you imagining the package would let the user specify something like this and how would you expect the aggregation to be done internally? Would we need to weight the counties according to population size? Or surname frequency? Or are you imagining something different? Would the user specify a list of states or counties to aggregate?

Can we add a function for seeing all of the names together? I know the dataframe would be enormous, but seeing just the top 100 white, Black, Hispanic, etc. names would be useful in some contexts.

Do you mean the top names in a county? State? Overall? I don’t think the data I have has enough information to provide this. The files I have don’t include a number for how many people in each state or county have a certain name. But maybe Philip does have data on this that I don’t have?

Slightly more straightforward to rename print_comparison_tables to just print (since you can mask the function and use the compare_bisg class)

Yeah I thought about this. I think we would need to set a default maximum number of tables to print instead of defaulting to printing one per row. Otherwise if someone used race_probabilities() with really long input vectors it could end up printing hundreds or thousands of tables. We could let the user specify something like options(calibisg_max_print = some_positive_integer) to change the default, which is similar to what base R does with its max.print option. How does that sound? We could also provide an argument to the print method to override the default. And what would you think is a good default number of tables/rows to print?

I went ahead and got rid of print_comparison_tables in favor of just defining a print method for the compare_bisg class. I set the default to print tables for a maximum of 5 rows but this can be changed either via print(max_print = …) or by setting options(calibisg.max_print = …), which will change the default for an entire R session. I’m not sure if 5 is the right default, just needed something to use for now. What do you suggest for a default?

I also added something similar for the number of digits to print. The default is 2 but can be changed either via print(digits = …) or options(calibisg.digits = …).

The GitHub repo should now be updated with these changes.

I went ahead and downloaded a third state, but probabilities couldn’t be returned for it (Florida)–I’m guessing that’s just not set up on the back end yet though.

It wasn’t set up for this yet, but I worked on it yesterday and I think it should be working now if you want to reinstall and try again. It should also now work to request states that we don’t have caliBISG for at all and it will still give you traditional BISG estimates. That should work for all but six states where I’m missing some information I need to calculate BISG, but Philip is working on getting me that info. So for now regular BISG should work for 44 states and caliBISG should work for 7 states. And for regular BISG you don’t need to download any large files. The BISG calculations are done by the package on the fly using smaller data files that are small enough to include with the package when you install it.

Thanks for your advice and feedback!

Gustavo:

This is our idea: If more than one state, separate inference for each state—all the counties in all of the states. Suppose 3 states. Package would need to do calIBISG for each state separately–then we would weigh by county level population size to get the metro area.

Regarding the top names in a county, I actually can’t think of a practical use for this feature, so let’s drop this one.

Regarding printing of the tables, the way it looks now (with the updated package) looks good to me. Adding a max print makes sense, too, but as a default let’s say four? Past four tables you need to scroll on most displays and that gets confusing.

Jonah:

Changing the default to print 4 tables sounds good. I’ll do that now. Were you also able to check that using more states works? I think you said you previously tried Florida before I had enabled that, but that should now be working. It should also now work to request estimates for a state we don’t have caliBISG for and you should still get BISG (except for 6 states that Philip is still getting the necessary data for).

For example, this should work for BISG even though we don’t have caliBISG for Maryland yet:

> most_probable_race(“Smith”, “MD”, “Allegany”)

name year state county calibisg_race bisg_race in_census
1 smith 2020 MD allegany white_nh NA

Warning message:
caliBISG is not available for 1 input(s). Returning NA estimates for those cases.

Gustavo:

I tested MD as you suggested, but also a bunch of other states. Unless I happened to guess the exact states that we don’t have the data for yet, there are still a bunch that the package doesn’t return BISG predictions for.

> most_probable_race(“Smith”, “MD”, “Allegany”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 MD allegany white_nh NA
Warning message:
caliBISG is not available for 1 input(s). Returning NA estimates for those cases.
> most_probable_race(“Smith”, “SD”, “Allegany”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 SD allegany NA
Warning messages:
1: caliBISG is not available for 1 input(s). Returning NA estimates for those cases.
2: Traditional BISG is not available for 1 input(s). Returning NA estimates for those cases.
> most_probable_race(“Smith”, “ND”, “Allegany”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 ND allegany NA
Warning messages:
1: caliBISG is not available for 1 input(s). Returning NA estimates for those cases.
2: Traditional BISG is not available for 1 input(s). Returning NA estimates for those cases.
> most_probable_race(“Smith”, “SC”, “Allegany”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 SC allegany NA
Warning messages:
1: caliBISG is not available for 1 input(s). Returning NA estimates for those cases.
2: Traditional BISG is not available for 1 input(s). Returning NA estimates for those cases.
> most_probable_race(“Smith”, “NC”, “Allegany”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 NC allegany NA
Warning messages:
1: caliBISG is not available for 1 input(s). Returning NA estimates for those cases.
2: Traditional BISG is not available for 1 input(s). Returning NA estimates for those cases.

Jonah:

Thanks for checking. I think the issue here is that there’s no Allegany county in the states you tried except Maryland. Currently at the least the state and county have to exist in order to be able to compute BISG using the data Philip sent me (Philip is that correct?). The surname doesn’t necessarily have to exist (we have an “all other names” distribution).

Gustavo:

Oh, yes, duh.

BISG estimate comes up here:

> most_probable_race(“Smith”, “SD”, “Buffalo”)
name year state county calibisg_race bisg_race in_census
1 smith 2020 SD buffalo aian NA

Jonah:

Ok great, glad that works. One other update about hosting the large caliBISG files. We had talked about hosting them on Harvard Dataverse since GitHub has stricter file size limits. However, I’ve been researching this a bit more, and if we don’t track the files in the GitHub repository but rather only attach them to GitHub releases of the package, it seems like there are no file size limitations. This would be nice because it means we could have everything in the same place rather than host the code on GitHub and the files elsewhere. I’m going to test this out by creating an unadvertised release so I can see if I can upload large files and download them with the package.

Me:

sounds good; thanks!

(As you can see, I’ve been a very valuable contributor to this process.)

Jonah:

A few quick updates and questions:

– I’ve got the GitHub downloads working. See below for instructions for trying it out.

– I think we’re basically ready for more people to try out the package. Do you think we should make the repository public for that or wait until we’ve gotten more feedback? If the latter, we can give specific people access to the private repo.

– We seem to not have full county names for Florida. Apparently the Florida voter file uses abbreviated county names, whereas other states have full names. Philip is working on getting the full names for Florida from the census files.

– I adapted the previous demo into a package vignette. See attached HTML version.

– To try out this newest version of the package use the code below. The GitHub personal access token (PAT) is required because it’s still a private repository.

install.packages(c(“pak”, “gitcreds”))

# will prompt you to enter your GITHUB PAT
gitcreds::gitcreds_set()

# you may need to restart your R session before
# trying to install the package if it fails to
# detect your PAT
pak::pak(“jgabry/caliBISG”)

# download all 7 caliBISG files from GitHub
# first deleting any old versions of the files
library(caliBISG)
delete_all_data()
download_data()

# check that the files were downloaded
available_data()

# use the functions like before
most_probable_race(“Smith”, “WA”, “King”)
race_probabilities(“Smith”, “WA”, “King”)

# I added a function to list the valid county names
valid_counties(“WA”)

Following up on my previous email, we now have full county names for Florida (thanks Philip!).

I just tagged a new test release (v0.0.2) and uploaded the new versions of the files (they’re all the same except for Florida).

1. Are you all ok with making the repository public now even while we’re still gathering feedback? Or would you prefer to wait until we’re fully ready for a release. Everything is easier if the repo is public (installing, downloading the data) but I can manage giving out access selectively if you’d rather keep it private for now.

2. I’ve added all of you as coauthors of the R package, but let me know if you’d rather not be listed. It’s common to have two citations in a case like this, one for the R package itself and one for the paper it’s based on. So when the user does citation(“caliBISG”) in R I’ll have it give both citations. Does that sound OK?

Gustavo:

Some feedback from testers. One was particularly helpful. I’ll send them over as I hear back:

Everything worked smoothly. Neat package.

I installed everything with no hiccups or issues. I played around a bit and the package is great — very intuitive and results look reasonable for the cases I tried.

In case they’re useful, three quick thoughts on usability:

FIPS support: I wonder if it’s worth supporting FIPS codes as alternatives to character for counties. That’ll help you avoid weird matching issues (e.g. looks like Saint Lawrence County in NY has to be formatted as “st.lawrence” or else you get NAs returned. FIPS codes might be a nice alternative for people who don’t want to worry about that cleaning.
Internal auto-replication: One minor thing: if you’re trying to get estimates for a bunch of names from the same county, it felt a bit clunky to have to replicate the state and county vector. For example, I think the way to check four surnames from the same county is to do something like:

most_probable_race(c(“Smith”, “Simko”, “Novoa”, “Gelman”),
rep(“WA”, 4),
rep(“King”, 4))

You need to replicate the state / county, or else you get NAs returned. I totally see why it works like that, to ensure the names / locations are all the same thing. But, I wonder if it’s worth implementing an edge case for length > 1 names, but length == 1 county and state and automatically replicate them internally (and maybe print a message that you did). To me, this looks much cleaner:

most_probable_race(c(“Smith”, “Simko”, “Novoa”, “Gelman”),
“WA”,
“King”)
# Message: generating BISG predictions for four surnames in King County, WA.

And that would let you directly insert a column from some other data frame, e.g:

most_probable_race(df$surnames,
“WA”,
“King”)

The counterargument is you could already just do that with data frame columns for state and county, so it’s not really critical. Just a small suggestion.

· Pivot option? I wonder if it’s worth having a quick binary argument in the two main prediction functions for a long data frame output. Some people might prefer it that way if they want to do some later grouping based on particular racial groups or estimates (e.g. all estimates > some fixed value). It would automate something like:

example_df <- most_probable_race(c("Smith", "Simko", "Novoa", "Gelman"), "WA", "King") example_df |>
pivot_longer(cols = starts_with(“bisg_”) | starts_with(“calibisg_”),
names_to = c(“method”, “race”),
names_pattern = “^(bisg|calibisg)_(.+)$”,
values_to = “prob”)

Good luck with the package! This is an awesome project. Any sense of when the other states will be added? I can’t wait to use it in my own work.

okay, a couple of notes. First, you guys need a tutorial. Just a basic tutorial showing how it works. It needs to be front and center in the github. Second > download_data()
* Downloading, reading, and saving file for: FL, 2020
Error: Failed to fetch release info: HTTP 404
available_data() returns 0?
character(0)
download_data(c(“VT”, “WA”), 2020)
* Downloading, reading, and saving file for: VT, 2020
Error: Failed to fetch release info: HTTP 404
looking at the example returns an http 404 error?
I wonder if this is because the package points to the github which isn’t open yet?
I also downloaded and compiled from source

Jonah:

This is great feedback, thanks! A few comments on the various suggestions:

– I think using FIPS codes in addition to names is a great idea. It’s definitely confusing that in some places the county names have appropriate spaces between words or after periods but other times they don’t. I guess this is just how they came out of the voter files? So providing FIPS as an alternative would be great. Philip, I guess we just need a dictionary to map between FIPS and county name for each state.

– I will definitely update to allow state and county to be length 1 so the user can provide a bunch of names for the same state and county more easily. I had thought about this at one point and forgot. I just opened an issue in the repository so I won’t forget.

– I have mixed thoughts on whether we should provide the pivot functionality ourselves since it’s pretty easy for people to do on their own. I’m open to it, just probably not urgent. I’ll open an issue in the repository so we remember to decide on this.

– We already have a tutorial, it’s just not front and center in the GitHub repository yet, it’s a package vignette. I forgot that installing from GitHub doesn’t automatically install the package vignettes. I’ll add some tutorial examples to the readme on the GitHub home page and add a note about how to get the vignettes when installing from GitHub.

– I think the person who got a 404 error while downloading the data doesn’t have a GitHub PAT set up. They said they downloaded the repository and built the package from source themselves, which doesn’t require a PAT (installing via pak() or install_github() does require one but not building it yourself). But running download_data() does require one. I just went ahead (one minute ago) and made the repo public, so I think it should now work without the PAT.

Here’s a link to the issue tracker for the package if you want to follow along when I complete things: https://github.com/jgabry/caliBISG/issues

Jonah again:

A few follow ups to my last email:

1. I’ve now already added the functionality for providing a single state and county with multiple different names. So this now works if you freshly install from GitHub:

most_probable_race(c(“Lopez”, “Jackson”), “WA”, “King”)

2. I also updated the readme on the GitHub landing page for the package to show how to download and access the tutorial vignette. I also added a very simple example to the readme itself.

3. Philip you asked previously what we would do about citations when we also have a JSS paper. I’ve seen cases where people replace the R package citation with the JSS citation and other cases when they ask people to cite everything. When someone calls citation(“caliBISG”) we can put a note indicating how we prefer they cite our work. For example, we could say to always cite your original paper about the caliBISG method and to cite the JSS paper if they used the implementation in the R package. Or we could ask them to cite all three (JRSS, JSS, R package). We can figure out what we prefer when we actually have the JSS paper.

4. Regarding the county names, I definitely still think FIPS is a good idea, but while we’re waiting on that would you all prefer if I made sure names with a period in them like the one mentioned in the feedback always have spaces (e.g. convert st.lawrence to st. lawrence)? Or would you prefer that we leave the names the way they are, which I guess is directly from the voter file or census? It would be easy for me to write a short script that makes sure there’s always a space after a period before I include the files with the package.

And I have a question about how we want to handle the FIPS codes. Here are a few options for how to let the user provide FIPS codes:
Add an argument fips that can be specified instead of county (using 3 or 5 digit FIPS codes):
most_probable_race(name, state, fips = FIPS)
Add an argument fips that can be specified instead of both state and county (using 5 digit FIPS codes):
most_probable_race(name, fips = FIPS)
Keep the user interface how it currently is and provide a function that converts between FIPS and county:
most_probable_race(name, state, county = fips_to_county(FIPS))
I’m open to any of these options (or a different one if you have a suggestion), but I do have a slight preference for the third option because it keeps the function signature the cleanest. It avoids the situation where we have a fips argument as well as the county and state arguments but only a subset of those arguments can be specified at a given time. In that situation we either need to error if they specify them at the same time or throw a warning and document which argument will take precedence. It’s not a huge deal to do that obviously, but in general I find APIs like that annoying and it seems cleaner to just have what we currently have but provide a way for the user to easily convert between FIPS and county.

Do any of you have a strong preference for one of these options or a different one?

Philip:

I prefer the third option and agree with what you wrote. The fips_to_county(FIPS) function would also have to take a state as input, right? Something like fips_to_county(state, FIPS)? Or alternatively a list of states fips_to_county(states, FIPS)?

An additional option would be to include FIPS county code in the output.

Jonah:

Yeah we could do it with a `state` argument if we ask for 3-digit FIPS codes. There are also 5-digit codes that include state info. The data you gave me has both, although in separate variables (I can combine them to create the 5-digit codes, which seem to be pretty widely used).

Gustavo:

Ditto on preferring option 3.

Yes would need to provide state as well.

For what it’s worth, there are functions from packages on CRAN that already do this. For example, censable::recode_fips_abb() converts state abbreviation to FIPS.

Jonah:

Glad you guys also prefer option 3. In terms of the function to convert between fips and county, do you have a preference between these two options?

1) fips_to_county(fips = “001”, state = “NY”)
2) fips_to_county(fips = “36001”)

Both of these options specify Albany county in NY. I guess we could support both, but if one seems preferable it’s simpler to just go with that option.

Philip:

I’ll defer to Gustavo on this one.

Gustvo:

Let’s do fips_to_county(fips = “36001”)
And just throw an error saying “fips should be a 5 digit code”, so it’s obvious what went wrong if people input something else?

Jonah:

So we’ll go with fips_to_county(fips = “36001”) with an informative error message if they don’t provide a 5-digit code. I’ll go ahead and implement that.

If you reinstall from GitHub you should now be able to use the fips_to_county() function. So, for example, these two calls to most_probable_race() should return the same output:

most_probable_race(
name = “Chan”,
state = “NY”,
county = “Albany”
)
most_probable_race(
name = “Chan”,
state = “NY”,
county = fips_to_county(“36001”)
)

If you provide an invalid FIPS code there are several different errors you could get, depending on if you provide the wrong number of digits or if the code is the right length but doesn’t correspond to any real county. For example:

> fips_to_county(“123”)
Error: `fips` must be a character vector of 5-digit FIPS codes.

> fips_to_county(“12345”)
Error: The following FIPS codes could not be converted: 12345

Me:

Hi all. I have nothing to add to this discussion . . . but could I blog it? This sort of realistic discussion about coding is not something that’s usually taught in school!

And they all said yes, so here we are.

I doubt any of you are interested in all the details above, but there’s something to be said for sharing this whole long exchange, just to get a sense of what it takes to build this sort of software package in a way that will be useful to people.

4 thoughts on “All the little decisions we have to make when developing public-facing software

  1. Since it’s a software post, I’ll ask a software question, first framed specifically, then generally: why not use existing publicly-available R code for converting FIPS codes to counties? What are the tradeoffs, and how do you evaluate them?

Leave a Reply

Your email address will not be published. Required fields are marked *