The cleantech job market: Every modeler is supposed to be a great Python programmer.

Posted on December 8, 2022 2:56 PM by Phil

This post is by Phil Price, not Andrew.

I’ve had a run of luck ever since I left my staff scientist position at Lawrence Berkeley Laboratory to become a freelance consultant doing statistical modeling and forecasting, mostly related to electricity consumption and prices: just as I finished a contract, another one would fall into my lap. A lot of work came my way through my de facto partner Sam, but then my friend Clay brought me into a project, and every now and then my friend Aeneas has something that he needs for his company, and I had a couple of clients who found me through having heard about me without any personal connection.

One lesson is: even in today’s world, with LinkedIn and websites and blogs and other ways of making ourselves known to the world, personal contacts matter a lot in getting consulting work. Or at least that has been the case for me. That’s been good for me because I’ve had good contacts, but it’s not necessarily good for society. If you’re younger and don’t have a lot of work experience, and you don’t have many friends doing the same sort of work you’re doing, you won’t have the advantages I’ve had.

So, for seven years everything was great. But this year has not gone so perfectly: I’m down to two clients at the moment, and one of them only needs a little bit of work from me each month. I’m looking for work but, never having had to do it before, I don’t really know how. But one thing I know is that people use LinkedIn to look for jobs and for people to fill those jobs, so I updated my long-moribund LinkedIn profile and clicked a few buttons to indicate that I’m looking for work. Several recruiters have contacted me about specific jobs, and I’ve also been looking through the job listings, looking for either more consulting work or for a permanent job.

Three things really stand out. Here’s the TLDR version:
1. There’s a lot of demand for time series forecasting of electricity consumption and prices.
2. The modeler has to write the production code to implement the model.
3. It’s gotta be Python.

That’s pretty much it for factual content in this post, but then I have some thoughts about why one aspect of this doesn’t make much sense to me, so read on if this general topic is of interest to you.

I. Modeling and Forecasting of Electricity Supply, Demand, and Price.

There are quite a few jobs for electricity time series modeling, and for optimization based on that modeling. Some companies want to predict regional electricity demand and/or price and use this to decide when to do things like charge electric vehicles or operate water pumps or do other things that need to be done within a fairly narrow time window but not necessarily right now. And then there are other forecasting and optimization problems like whether to buy a giant battery to use when the electricity price is high, and if so how big, and how do you decide when to use it or recharge it. All of this stuff is right up my alley: I’m good at this and I have lots of relevant experience. To give an example of a job in this space, here’s something from a job description I just looked at (for a company called Geli): “Your primary responsibility will be to lead the development of our time series forecasting models for solar and energy consumption using machine learning techniques, but you will also help develop new forecasting models as various needs arise (eg: prototyping forecasting wholesale prices for a new market).” This is extremely similar to work I have been doing off and on for one of my clients for the past eighteen months or so. Sounds great.

And there’s a bullet list for that same job listing:
* Feature engineering
* Prototyping new algorithms
* Benchmarking performance across various load profiles
* Integrating new forecasting algorithms into our production code base with robust test coverage
* Collaborate with the rest of the team to assess how forecasts can be adjusted for various economic objectives.
* Proactively identify opportunities within [our company] that can benefit from data science analysis and present those findings.
* Work collaboratively in a diverse environment. We commit to reaching better decisions by respecting opinions and working through disagreements.
* Gain in depth experience in an exciting industry as you work with storage sizing, energy financial models, energy tariffs, storage controls & monitoring.

Almost all of that bullet list sounds great to me. But not literally all of it.

II. Modelers have to be coders.

The one thing that doesn’t? “Integrating new forecasting algorithms into our production code base with robust test coverage.” It’s funny how this is just sort of stuck in there among the other items because writing production code, and the tests for the code, is a different skill from conceiving, writing, and testing the models in the first place.

If you talk to the recruiters or look at the detailed requirements for the job, it’s explicit that they want the person who does this job to write the production code to implement the models.

III. The coding is going to be in Python

One of the requirements for that job: “Advanced Python skills, as well as familiarity with pandas and scikit-learn.”

Everyone wants Python. I have looked at a few dozen job listings that are superficially similar to this one and every one of them wants Python. I haven’t seen a single one where they’re looking for R or even C++; Python rules this roost. I think this may not be the case for other STEM areas like biotechnology, where I think R is still common, but in energy forecasting and optimization Python is really all that matters.

I think this may be related to the desired relationship between modeling and coding: R is (in my opinion) vastly better than Python for exploratory data analysis and graphics. I’m good with R and decent with Python and I find it much faster and more pleasant to do my initial analysis and simple modeling in R, to the extent that I’ll sometimes do it that way even if I ultimately need to deliver something in Python. When I was a Python newbie a few years ago I thought this was just lack of experience on my part, but it’s been clear for a while that that is not the case. Python doesn’t yet have anything remotely close to ggplot for rapidly making exploratory graphics, for example.

The fact that Python is preferred to R for production code is unsurprising. R is extremely slow at a lot of tasks, for one thing, even more than Python. For another, R’s object-oriented programming implementations seem a bit weird; I think they were not originally part of the language at all but were sort of grafted on, whereas Python’s object-orientation is more organic. R has at least four object-oriented systems you can use (S3, S4, RC, and R6) and there are cases to be made for all of them, which is maybe an indication that none of them are all that great.

IV. Discussion

The fact that all of these companies want the modeler to write production code (literally all that I have seen so far) is a problem for me because I don’t like writing production code and I’m not very good at it. I claim to be very good at modeling and not good at production coding, but you’ll have to take my word for my modeling skills…so I hope Andrew (Gelman) won’t mind if I use him as an example instead, because you have more reason to believe me when I say: Andrew is among the best in the world at conceiving of Bayesian mult-level models — certainly among the top few percent of people who regularly do such modeling — and yet I think he would agree that he’s no great shakes at coding them once they’re conceived. Well, I write code like Andrew does, which is the way the Neanderthals did it: I tend to have a procedural rather than an object-oriented way of thinking of things, I tend not to think at all about computational efficiency when I’m designing the model, and I often write code that looks kinda ugly and hard to read unless/until I go back later and fix it up. It’s a bit hard for me to judge my skills compared to the average programmer but I think that I write fair but not good Python code, and certainly not excellent Python code; if I were being graded among professional Python programmers I’d be hoping for a B- but expecting a C. If a company says that they need excellent Python skills, and they mean it, then I’m not the right person for that job.

I’m a fairly intelligent person and I’m sure I could learn to write better code if I have to, and to some degree I don’t much mind if I have to…but I am never going to _enjoy_ coding the way I enjoy the modeling part of the task. I’d much rather get something working and then hand it off to someone else who can refactor it for speed and clarity, and have it conform to the desired style conventions, etc. etc. There’s a lot of overlap in the skills required to write a good model and the skills required to write a good program, but the overlap is very far from perfect.

Because of my enjoyment of modeling but dislike of programming qua programming I may be biased in my evaluation of the situation, but that doesn’t mean I’m wrong when I say: I don’t think it makes a lot of sense to require that the modeler write the production code and the tests. Or rather, this might make sense for a really small company but I don’t think it makes sense for the companies I’m looking at. It’s sort of like putting together a football team and requiring that every player be able to play both offense and defense. It’s not like it’s totally ridiculous — if someone has the skills to be a wide receiver they can probably learn to cover the other team’s wide receivers pretty well — and certainly if you do find someone who is great at both roles then it makes sense to hire them. But as a requirement it is very limiting. You’re trying to optimize the performance of your team, and in general you’re not going to get that if you insist that every player fill multiple roles.

Fred Brooks, author of the classic book “The Mythical Man-Month”, died recently. I don’t know if anyone reads that book anymore but for a few decades it was seen as a valuable source of insights into developing software systems but also into management in general. One of Brooks’s points is that in programming, as in any sphere of human endeavor, the best people are much much better than average, even among professionals. The best basketball player on a professional team is much better even than the fourth or fifth-best.

In the past few months tens of thousands of programmers have been laid off here in the Bay Area. Literally tens of thousands. Some of these are much better than I will ever be at programming, even if I really try to improve. Different people are talented at different things and some people are talented programmers. Others of us are talented modelers. Why not let me use my Neanderthal-level programming skills to get my model working, and then pass it along to one of these talented programmers to refactor it into something compact and readable, and write the tests for it? This either frees up my time to do more modeling, or to sit around doing nothing and not getting paid. Hire me part-time or as a consultant to write the models, and let the great programmers do the programming.

Indeed, this is exactly how things have gone with my work for one of my current clients. We started out with just me and my friend Clay trying to do everything in a software development task. We were a great team for doing the basic modeling but we were struggling with turning it into a good program so we brought in a frontend programmer and a backend programmer and put the entire codebase into their hands. Between the two of them, they refactored just about everything Clay and I had done. The hours per month that Clay and I had been putting into the project dropped way back because the other two were now doing most of the work, but that’s the way it should be…in my opinion.

Unfortunately for me, the job market does not seem to agree. It appears I will have to improve my Python programming skills, so I’m going to work on that. There are plenty of online tutorials and other resources so I guess I’ll look into some of those. If you have advice please leave it in the comments.

This post is by Phil.

70 thoughts on “The cleantech job market: Every modeler is supposed to be a great Python programmer.”

Ethan on December 8, 2022 3:21 PM at 3:21 pm said:

> The fact that Python is preferred to R for production code is unsurprising.

One other likely aspect is that Python has a much better error handling system than R, which can often be critical for production systems. A lot of R code will just crash if it runs into an error, while Python tends to rely more on exceptions that can be smoothly handled.

> If you have advice please leave it in the comments.

Not sure about your development setup, but my main tip is to try out Visual Studio Code for Python development. Also, check out Python’s new type hint features, those are generally essential for large codebases. https://mypy-lang.org/

Reply ↓
- Ethan on December 8, 2022 3:25 PM at 3:25 pm said:
  
  Also, even for Python development, C++ skills can be extremely useful. Being able to implement critical parts of the code in C++ (and use those critical parts in Python) is extremely handy!
  
  Reply ↓
  - Rick Carlton on December 8, 2022 7:35 PM at 7:35 pm said:
    
    Is there a book that teaches this?
    
    Reply ↓
  - Wonks Anonymous on December 8, 2022 11:15 PM at 11:15 pm said:
    
    I’ve written Python in a few places, but in none of them was anyone writing any C++. A couple have been places that were largely C#/.NET with me writing a bit of Python that none of the other developers were expected to deal with, and now I’m in the new-to-me situation of writing Python while another team is writing C, which is quite the chasm between languages. But then I know C++ is a hugely popular language and if I haven’t encountered it professionally that presumably reflects me going down a canalized path from using other languages and for organizations in sectors where certain technologies are more/less popular than others.
    
    Reply ↓
  - njo on December 9, 2022 1:02 AM at 1:02 am said:
    
    Here’s what my experience of this has been.
    
    I know both Python and C++, and I rarely use C++. At my current workplace, most of the programmers come from an econometrics background, and don’t know C++, so I don’t want to write code which will be impossible for them to maintain. I find that most of the time, you can accomplish something reasonably close to a C++ implementation using NumPy. That’s the tool I reach for now when Python is not fast enough.
    
    Reply ↓
Dale Lehman on December 8, 2022 3:30 PM at 3:30 pm said:

You got my attention! I am extreme – more so than probably anybody that regularly reads this blog. I do virtually no coding in any language. I’m not advocating that as a general positive trait, and I certainly don’t advocate that for my students. But I don’t think that coding (even poorly done) is necessary or sufficient for good data analysis. I recognize that a production environment does require it, but as you suggest, that need not mean that everybody that works with data must be responsible for coding it for production.

What I find puzzling and disturbing is the pronounced role given to coding skills for data scientists, rather than to data sense-making skills. I think it is largely due to the former being much easier to measure than the latter. It may also be due to the fact that many decision-makers view data analysis as something that must be done, but not something they really take seriously – they will ignore the analysis if it doesn’t suit them (or rather insist it be redone until it shows what they want). As long as some algorithm gets put into production, they can check of a task accomplished and nobody worries much about what the algorithm is doing. Poor decisions may result, but it is difficult to attribute the responsibility.

So, I echo Phil’s concern though I doubt he would go as far as I do in my complaint.

Reply ↓
Skathmandu on December 8, 2022 3:40 PM at 3:40 pm said:

Wow, I’m feeling this pretty much word for word. I’m a methods-heavy epidemiologist working in industry and my job is maybe 80% programming, 20% modeling. I love the modeling part, am very slow and clunky on the programming part, but it’s not like there are loads of modeling jobs waiting for me…

Reply ↓
Mitzi on December 8, 2022 3:55 PM at 3:55 pm said:

“Python doesn’t yet have anything remotely close to ggplot for rapidly making exploratory graphics, for example.”

Actually it does – a pretty much feature-by-feature port of ggplot2 called “plotnine”. I put together a case study last summer showing how to use it: https://mc-stan.org/users/documentation/case-studies/radon_cmdstanpy_plotnine.html

Reply ↓
- Phil on December 9, 2022 1:04 AM at 1:04 am said:
  
  Mitzi, that’s fantastic! I can’t believe I hadn’t heard about this before. Nobody tells me anything! WTF am I doing with matplotlib and Seaborn? Thanks very much, this will change my life. Not in a really huge way, I don’t want to go overboard here. But it will improve my life noticeably.
  
  Reply ↓
  - Stef on December 9, 2022 2:15 AM at 2:15 am said:
    
    Another vote for plotnine. I never learned R but plotnine is essential to exploratory data analysis in Python, and it’s a shame it’s not better known.
    
    Reply ↓
    - Ben on December 9, 2022 10:19 AM at 10:19 am said:
      
      +1 for plotnine. I’ve really enjoyed using it.
      
      In other possibly new things, I didn’t learn about vscode until 2021, but it’s great. There’s at least three substantially different ways to work with Python in it (regular Python, # %% delimited python files [that act kinda like notebooks], jupyter notebooks). It’s cool there are so many different interfaces these days. I end up working in a lot of different things (jupyter notebooks, vscode, Rstudio, and PyCharm) depending on what I’m doing.
  - Michael Turtora on December 14, 2022 1:42 PM at 1:42 pm said:
    
    I haven’t heard of plotnine either but I’ll have to give it a try. In the meantime, what about Plotly? Once you start to understand it, it’s fantastic for interactive analysis.
    
    Reply ↓
- Carl on December 9, 2022 2:10 AM at 2:10 am said:
  
  Yes, thank you Mitzi! Hopefully plotnine finds continued support!
  
  Reply ↓
John Williams on December 8, 2022 4:44 PM at 4:44 pm said:

Here is a perhaps amusing story about a failure in forecasting electrical demand.

I live in Petrolia, in a remote part of Humboldt County, CA, in the “Emerald Triangle.” A few months ago our electrical utility, Pacific Gas & Electric, fessed up that it had no more capacity to serve the southern part of the county, because electrical use by the marijuana industry for greenhouses, drying plants, etc., has maxed out PG&E’s transmission capacity.

Reply ↓
- Phil on December 8, 2022 5:44 PM at 5:44 pm said:
  
  One of my first consulting jobs, back in 2015, involved looking at electric vehicle charging: how much was it happening, how much was it increasing, to what extent could we forecast what it would look like over the next decade and longer, etc. The client was PG&E, and they were trying to figure out how their infrastructure would need to change in order to charge all the electric vehicles that they could see coming over the horizon. One of the big issues was: how do you know how your existing customers with electric vehicles are behaving? How often do they charge them, for how long, starting at what time, etc. We (me and my partner in the work) asked about the feasibility of getting data on who owns electric vehicles, e.g. by using California’s data on car registrations, but…well, I don’t know what the legal impediments would have been but we were told it was definitely not feasible on the timescale of our consulting project. What PG&E did know, though, was who was on an electric vehicle rate plan. If you owned an electric vehicle you could get on a special PG&E rate plan that made electricity very cheap overnight but more expensive during the mid-afternoon and early evening: they want EV owners to charge when the grid is not under stress. Not everybody with an EV is on an EV rate plan, indeed many are not; and some people get on the plan but then get rid of their EV but don’t switch back, so this isn’t perfect but it’s what we had.
  
  So we get the data and we’re looking at this and that, including looking at whether we can find a signature of EV charging so that we can determine which of PG&E’s customers have electric vehicles but aren’t on an EV rate. One oddity that stuck out was that in Humboldt County a disproportionate fraction of customers were on an EV rate. You can only get on an EV rate if you have an electric vehicle registered to you, and it seemed pretty amazing that Humboldt County would have so many EVs..especially seven years ago, when EV penetration was much lower than it is today (and it’s still pretty low).
  
  But also, when we looked at the electricity consumption pattern of those customers it didn’t look like EV users elsewhere. It’s really easy to see when someone starts charging an electric vehicle, because that immediately becomes the biggest load of the day by far. And of course you can see when that load shuts off. But some EVs haven’t been driven very far and only need to recharge for a few hours; others need to charge all night; etc. So when you look at the load in EV households you see a wide variety of durations for charging. But not in Humboldt! In Humboldt County there were lots and lots of customers whose maximum load would start the moment the price dropped, and the load would stay at the maximum all through the night and the next morning and wouldn’t turn off until the price stepped up again.
  
  As you have no doubt recognized by now, these people were not charging electric vehicles, they were growing pot.
  
  Reply ↓
Jonathan (another one) on December 8, 2022 5:01 PM at 5:01 pm said:

I’m a modeler who’s a sometime *clever* programmer but never a *great* programmer as well. Here’s my suspicion. If you have separate guys doing the the programming and the modeling: (1) they can get into fights that management can’t resolve because management has now idea *how* to resolve them.
(2) When the project goes bad, neither the programmer nor the modeller is going to take teh blame.
(3) Management wants a production model. A great model they can’t turn into a production model is just as useless as a crappy model brilliantly instantiated in a program.
So what they really want is a modelling/production code team. But they don’t know how to to pair the two halves together, or how to buy one without the other. So they insiste on both, knowing at least now if the project goes bad, they know *exactly* who to blame.

A final consulting anecdote. At the 500 or so sized company I worked at for years, I did more modelling than anyone else and coded it up like you did… in code that worked, but was… bad. So I asked if I could just be relieved of those duties and just set up a modelling department with no programming responsibilities. They said no, and said it was largely because of the reasons above. They’d rather have kludgy code than fights betwee modellers and programmers.

Reply ↓
somebody on December 8, 2022 5:02 PM at 5:02 pm said:

I am a full-stack engineer and statistical modeler, and yeah this is a problem. The steel man reason for this lack of division of labor is that it’s easy for things to get lost in the abstraction. It’s pretty easy for seemingly innocuous implementation decisions to meaningfully alter the distribution of model inputs in a way that creates serious performance differences between prototyping and production. But this is a problem that can be solved by good communication and code review practices.

The bigger reason in my opinion is that bad code is easy to notice while bad models are not that obvious, so most companies in practice only need software engineers who know a little bit of modeling. The modeling they’re looking for is the ability to find and grab covariates that look relevant, apply some basic log(1 + x) transformations, and feed them into a black box point estimator (usually xgboost), and then graph a cross validated ROC curve. In other words, a modeler in a role like this needs

1. The quantitative literacy to understand what numbers might be relevant and how to put them on a reasonable scale
2. The technical skills to actually pull those numbers from a datastore and write outputs to another datastore
3. The knowledge to specify one of a family of off-the-shelf algorithms that’s appropriate

Once this is done, if the model raises an unhandled exception during every other data pull or eats all the CPU on a source database or takes 10x too long to run, everybody notices immediately. If the output numbers exist on time and are the correct datatype, but are completely ridiculous, it’ll take a much longer time for anyone to notice, and the only ones who would notice are confined to a much smaller domain. If the output numbers seem reasonable, but have a subtle bias or are not as tight as they could be, nobody might ever notice. It might get exposed if the model is put in charge of hard financial decisions that get A/B tested or otherwise evaluated (insert issues with A/B here). But otherwise, a lot of model performance evaluation is left to the discretion of the modeler themselves.

So I’d say you’re probably severely overqualified for what these people are looking for in terms of statistics and modeling. A lot of them aren’t expecting a custom application specific model, they’re expecting you to be able to write a SQL query and glue the outputs to prophet or a recursive boosted tree (which is, imo, a shockingly bad product for forecasting, its stan backend notwithstanding). I think this data plumber + algorithm cookbook approach is fine to get some smoothing or relative ordering for reducing the complexity of some decisions. But it gives up a lot more than people think; where units are not truly i.i.d., where errors are clustered, where uncertainty quantification and calibration are important, all of which represent more real-world problems than people typically assume.

Needless to say, I’m not totally happy with the status quo. Too many people think that because sklearn.classifier.predict_proba gives a vector constrained to the unit simplex, it must be giving a calibrated probability, and the only ones with the standing to tell stakeholders that it isn’t are the ones giving it to them in the first place. People want answers fast and confident, and it’s tempting to give it to whatever instead of taking the time to check if it’s correct; who’s going to check anyways?

Reply ↓
- somebody on December 8, 2022 5:22 PM at 5:22 pm said:
  
  As for concrete advice, the technologies to get familiar with are:
  
  1. numpy and pandas for number crunching
  2. pydantic for general data structure boilerplate
  3. For plotting, plotnine is a nice easy way to reuse your ggplot knowledge. Personally, I like bokeh. I’d stay away from base matplotlib at this point
  
  The new wave is to make your applications “data oriented” rather than object oriented a-la early 2000s. Abandon the old school literalist “nouns and verbs” metaphor where objects are things that do methods to other objects. Don’t have deep inheritance chains or objects that do a lot. Instead, think more functionally; start from the last stage of your output, what data is in your output? Design a data structure or set of composable data structures to hold that data that looks clean and readable. Now, how does that data structure get constructed from data at a previous stage? Now design the input data structures for that previous stage, and repeat; how do you get those? Work that way until you get to your top level inputs.
  
  Once you’re past prototyping, always type hint your code and check the type hints as far as you can with mypy. It’ll catch a lot of errors before it happens, and makes the code a lot more readable.
  
  On a more modeling specific advice, it can help keep things clean if you move the data munging into an sklearn “Pipeline”
  
  https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
  
  Of course it’ll be a mess of functions during prototyping, but if you can arrange it into a sequence of definable steps and put it into a sklearn pipeline, it’ll help handle a lot of code reuse and variable saving for all your log transformations and standardization and one-hot encoding. Unfortunately, sklearn will push “pickle” on you when it comes time to save your pipeline to disk. Do this instead
  https://onnx.ai/sklearn-onnx/
  
  Reply ↓
  - somebody on December 8, 2022 5:23 PM at 5:23 pm said:
    
    Oh yeah and some dialect of SQL for data retrieval, usually based on postgresql
    
    Reply ↓
    - Ben on December 9, 2022 10:47 AM at 10:47 am said:
      
      All this makes sense, but are we just telling Phil to be more of a software engineer?
      
      Sure, we can do the software engineering (Phil says as much), but I personally have a lot of trouble keeping it all in my head. I don’t think it’s a 1 to 1 tradeoff (learn Docker, forget a distribution or two), but for me it’s definitely a tradeoff.
      
      I’m also pretty convinced that the more I take on individually the more mistakes I make (for any number of reasons). Certainly I like learning new things, but it also seems silly for me to do this when there are plenty of people around me better at those things!
      
      I think there are different answers to these questions at the individual and organizational level (Should someone be allowed to work outside their strict role? Absolutely! Is it a long term good strategy to rely on this? I doubt it!).
- Josh Rushton on December 8, 2022 6:32 PM at 6:32 pm said:
  
  +1 for “bad code is easy to notice while bad models are not that obvious”. A very slight twist is that functional coded products have a concreteness that good modelling lacks. It’s just extremely hard for decision makers to rationally weigh the value of good modelling against the value of a polished user product.
  
  Reply ↓
- somebody on December 8, 2022 8:17 PM at 8:17 pm said:
  
  I have another comment pending moderation with python programming recommendations. As an add on to that, I highly recommend getting familiar with environment management tools. In particular, I would recommend pyenv for managing python versioning, the pyenv-virtualenv plugin for an isolated dependency environment, and poetry for project management. Unfortunately, the general ecosystem for environment management in python is a fragmented mess since it’s a relatively early interpreted language, and the good tools are not built into the project. It’s fast and easy and therefore tempting to just install the packages you need with the built in system pip, but in the long run not using a proper dependency solver in a production environment will probably cause a few headaches
  
  Reply ↓
  - Phil on December 10, 2022 12:43 PM at 12:43 pm said:
    
    Somebody,
    Thanks for taking the time to give many useful tips and much advice.
    
    One of my programmer friends has already drummed into me the need to use a virtual environment. This is especially true in Python, I think, where it seems more common than in other languages that the next version breaks stuff in previous versions (this is true of libraries and of the language itself). I suppose this is considered to be a feature, since it allows everything to keep evolving rapidly rather than the old FORTRAN approach in which the language got a major update maybe once every fifteen years or whatever it was.
    
    Anyway thanks.
    
    Reply ↓
    - Daniel Lakeland on December 10, 2022 1:36 PM at 1:36 pm said:
      
      One of the huge advantages on Julia is that reproducibility of package environments is built in from the start and works exceedingly well. Way way better than Python afaict. Of course you need to work with people who are willing to consider or even want Julia.
- Phil on December 9, 2022 1:02 AM at 1:02 am said:
  
  somebody,
  Your general take on the modeling and programming landscape seems to be consistent with what I’m seeing, both in the job listings and elsewhere (such as blogs and YouTube videos that purport to teach various aspects of modeling). It seems that to a lot of people modeling just means picking a function out of scikit-learn that ingests the input data and outputs a forecast, and that’s basically it. Do a train/test split, call a canned routine (sarimax or garch or xgboost or whatever), generate the forecast…that’s it. Why, you can learn to be a statistical modeler in ten minutes!
  
  Perhaps a lot of the time that is all you need. And over my career — both my consulting career and my scientific research career — sometimes it really has been that simple. Just fit a random forest or a linear regression model or a SARIMAX model and you’re done. But although I have sometimes had that experience, it has been far more common that getting a good model has required going beyond the canned models. As you say, “I think this data plumber + algorithm cookbook approach is fine to get some smoothing or relative ordering for reducing the complexity of some decisions. But it gives up a lot more than people think; where units are not truly i.i.d., where errors are clustered, where uncertainty quantification and calibration are important, all of which represent more real-world problems than people typically assume.” Yes, exactly. But if people can’t tell the difference between a bad model, a fair model, and a good model — or if they don’t care — then there’s no need to try to build a good one. Maybe I have been underestimating the extent to which this is the state of the world.
  
  Reply ↓
  - Daniel Lakeland on December 9, 2022 1:08 AM at 1:08 am said:
    
    A market for lemons.
    
    Reply ↓
  - somebody on December 9, 2022 11:45 AM at 11:45 am said:
    
    I don’t think it’s all you need. A bad model that looks reasonable can be actually worse than no model at all! These problems cost organizations real money. But more often than not, if the modeler doesn’t point it out themselves, no one else will, either because they don’t notice or because they lack the mathematical clout to be listened to.
    
    Reply ↓
    - Zhou Fang on December 9, 2022 12:06 PM at 12:06 pm said:
      
      I dunno, “no model at all” is often in fact a “bad model that looks reasonable”!
  - Daniel Lakeland on December 9, 2022 12:47 PM at 12:47 pm said:
    
    Because not everyone would necessarily understand the point of my short comment, I’ll expand because I think it’s an interesting part of the conversation.
    
    In the classic paper “A market for lemons” (Akerlof 1970) https://en.wikipedia.org/wiki/The_Market_for_Lemons Akerlof discusses the case where precisely asymmetry of information and the inability to distinguish high quality from low quality (in used car markets, hence “lemons” here means bad cars) causes the market for some good to converge towards paying bottom-dollar for the lowest quality product. This is exactly the situation described by Phil’s words: “But if people can’t tell the difference between a bad model, a fair model, and a good model — or if they don’t care — then there’s no need to try to build a good one.”
    
    In fact, it’s not just that there’s no need, it’s that it becomes economically infeasible to do a good job. Doing a good job, providing high quality, takes time and effort. A considerable amount of it really. Phil for example got where he is by doing a PhD in Physics and maybe 20 years or something studying building energy usage for the DOE and learning to use Stan and discussing modeling on the blog here and such. Those are sunk costs, but if you don’t pay them (or something like them) up front you really can’t do a high quality job of modeling later. You have to understand how modeling works, and to get there you have to have tried modeling a lot of things. Plus, when actually building a given model you need to explore the best methods of describing the problem, try out different combinations, etc and those are not sunk costs, they are costs incurred with each new project.
    
    Plugging and chugging with scikit learn or whatever, is not that. The software development sunk costs are paid by someone else (the software developers) and the current costs of just running the model are basically straightforward computing cycles issues that pretty much everyone understands.
    
    So, here’s the situation. One person could build a model which is **orders of magnitude better** at predicting the outcomes of possible “outlier events” such as hurricanes and floods and such. But because they only happen every so often, it could take a whole lifetime or even 1000 years to “show” that the model is better from data. Instead you could potentially show that the model is better by discussing the structure of the model and why it does a good job, but the consumers are ignorant enough that it doesn’t matter, they can’t distinguish between your babble and someone’s babble about stochastic gradient descent and boosted trees or whatever.
    
    Just like in the “lemons” example, where the consumers are completely unsure of the quality of the cars, they are unwilling to pay substantially more than what it costs to push the scikit learn button, no matter what the quality of the product because the quality of the product is something hidden behind a screen they can’t discern.
    
    Because they’re not willing to pay for considerable investment, the people who provide the very high quality models remove themselves from the market, or begin providing the low quality thing anyway even if they know how to provide the high quality thing. The remaining market is exclusively for the low quality thing, except in very bespoke situations.
    
    As far as I can tell, this is for the most part where we are. The distinguishing factors for the buyers is **how well / reliably does the software spit out the numbers we need** not *do the numbers make sense*.
    
    Reply ↓
    - Dale Lehman on December 9, 2022 1:13 PM at 1:13 pm said:
      
      Very nice description of the lemons problem. Depressing, though. It made me think of teaching: identifying excellent teaching (or even better teaching) could take a whole lifetime to detect. The market rewards low quality teaching better in relation to its cost than for high quality teaching. As they say on Marginal Revolution, “Solve for the equilibrium.”
      
      Of course, as with all asymmetric information problems, there are myriad ways to reduce the problem (monitoring, credentialing, evaluation systems, etc.) – but all involve costly resources so all are somewhat imperfect. Inevitably, the asymmetric information problem cannot be “solved,” only reduced by some “optimal” amount.
    - Phil on December 9, 2022 2:14 PM at 2:14 pm said:
      
      I love this from the Wikipedia article: “Both the American Economic Review and the Review of Economic Studies rejected the paper for “triviality”, while the reviewers for Journal of Political Economy rejected it as incorrect, arguing that, if this paper were correct, then no goods could be traded. Only on the fourth attempt did the paper get published in Quarterly Journal of Economics. Today, the paper is one of the most-cited papers in modern economic theory and most downloaded economic journal paper of all time in RePEC (more than 39,275 citations in academic papers as of February 2022). It has profoundly influenced virtually every field of economics, from industrial organisation and public finance to macroeconomics and contract theory.”
    - Dale Lehman on December 9, 2022 2:25 PM at 2:25 pm said:
      
      Akerlof, the author of The Market for Lemons, also recounted his efforts at publication (though I can’t remember where I read this). I recall him saying that the original paper (the one that kept getting rejected) was “better” than the one ultimately published. He had to introduce additional mathematics (which I believe he said was unnecessary and only made the paper harder to read and understand) in order to get it accepted.
- Jackson Jules on December 10, 2022 8:22 PM at 8:22 pm said:
  
  This entire thread was excellent. Thank you.
  
  Reply ↓
AndyW on December 8, 2022 7:26 PM at 7:26 pm said:

Team coder here. I know the academic crowd here for Andrew’s blog is maybe not in agreement with this (and I think [somebodys](https://statmodeling.stat.columbia.edu/2022/12/08/the-cleantech-job-market-every-modeler-is-supposed-to-be-a-great-python-programmer/#comment-2141341) comment is spot on as well).

1. Backtesting a model on historical data is not sufficient to tell if the model is good in production. It is common for people fitting models to misunderstand aspects of how the model will be put into practice that either make it totally infeasible (e.g. using information not available at runtime, misunderstanding computational constraints) or biased (e.g. historical labeled data is fundamentally different than the realtime data the model is being extrapolated to). So sorry you may *think* your model is good, you don’t know until it is actually in production and you have real life results whether it is good or not.

2. Very little of the development time is spent fitting a model! To oversimplify, I break it down into A) work with business people to translate their needs into something mathematically tangible, B) fit a mathematical model, C) write code to automate the process of giving the model predictions/forecasts to the right groups. A and C are easily the hardest/longest parts. I presume Phil is great at A as well, if you do A well all B entails is:

import library
model = library.model(your_model,your_data)
library.save(model,’model_artifact.pkl’)

B is close to trivial (and getting closer all the time with easy tools to create auto-models). (To be fair B also involves this like showing the model is accurate on historical data, but still pales in time spent for either A/C.) IMO really need people to be more middle manager roles and do A much better, B is just such a small part.

Final point, the stuff about python being more popular due to speed or syntax ease I am quite skeptical of. I don’t see people using classes very often, and python is not fast in absolute sense (actually *fitting* models I am pretty sure R is faster for many scenarios). My hot take is environment management in python has been around longer, e.g. `pyenv` and variants are old, whereas managers like `renv` are much newer.

There is a glut of people getting masters degrees in data science who can copy-paste code to do B and produce charts of accuracy given a nice dataset just fine and dandy. They can’t per se do A well, and C is almost a totally different skill set (as Phil attests to here).

Reply ↓
Matt Harrison on December 8, 2022 8:31 PM at 8:31 pm said:

I teach Python and Data Science to large corporations who “need to do things in python”. I’m obviously biased, because I make my living from it, but it is a good safe choice for just these days.

There are thousands of packages for Python, it is taught in the university, and it is the defacto language of data science.

Sure, it has drawbacks like anything else, but if I’m advising someone about where they should put their focus for learning, as a data person it makes sense to learn what will be the best investment. It is easy to make that case for Python.

Reply ↓
Wes Turner on December 8, 2022 10:23 PM at 10:23 pm said:

> [ because I don’t like writing production code and I’m not very good at it. ]

#ModelOps #MLops; “BentoML” (FastAPI (Authors of DRF))

E.g. Pytest has a @pytest.mark.parametrized decorator for testing units of functionality and integration tests for fearless refactoring. There’s a new Julia Evans Zine on Debugging that strongly encourages preemptively including test assertions, too

> [Python or R]

You can install condaforge with “MambaForge”.
Conda-forge builds Julia, Python, R, Rust packages. Rust has Polars, which is fast and has a Python API and does SIMD. Apache Arrow does fast structs with schema between languages including Python, R, JS, WASM. Polars and Apache Ballista are built with Arrow and SIMD.

> 2. Very little of the development time is spent fitting a model! To oversimplify, I break it down into A) work with business people to translate their needs into something mathematically tangible, B) fit a mathematical model, C) write code to automate the process of giving the model predictions/forecasts to the right

A) Constraints, check figures, tests

B) AutoML has unbiased background for its null hypothesis formation and model evaluation sequence .

“Modeling and Simulation in Python” could be a (free, and also purchasable) Jupyter-Book of RST, Markdown, and Notebooks
https://greenteapress.com/wp/modsimpy/

C) DVC.org does branch metrics. Ml-hub can spawn per-user ml-workspace containers with conda. Kaggle/docker-python also has conda in their container.

ContainDS, BinderHub / JupyterHub / awesome-jupyter\*

https://cdsdashboards.readthedocs.io/en/stable/ :

> Run a private on-premise or cloud-based JupyterHub with extensions to instantly publish Jupyter notebooks (Voilà), Streamlit, Plotly Dash, Bokeh / Panel, and R Shiny apps as user-friendly interactive dashboards to share with non-technical colleagues.

Reply ↓
- @westurner on December 8, 2022 10:28 PM at 10:28 pm said:
  
  Meant to mention; also:
  
  [sympy.utilities.lambdify.lambdify] provides convenient functions to transform #SymPy expressions to
  lambda functions which can be used to calculate numerical values very fast with any of a number of modules: math, mpmath, NumPy, SciPy, CuPy, JAX, TensorFlow, SymPy, NumExpr
  
  Source: https://github.com/sympy/sympy/blob/master/sympy/utilities/lambdify.py
  
  Docs: https://docs.sympy.org/latest/modules/utilities/lambdify.html
  
  Reply ↓
- Phil on December 9, 2022 1:31 PM at 1:31 pm said:
  
  Wes,
  Thanks for all this. It’s sort of interesting but also a good reminder of why I don’t want to have to become “a programmer” and would rather remain “a modeler who writes models in code.” Every few years there’s a new set of stuff to learn and a new bunch of jargon that goes with it. Use brew; no, use pip! use pipenv; no, use venv! Write a restful api, use django. MambaForge, Rust, Polars, Ballista, Arrow, SIMD, DVC, ContainDS, BinderHub… no doubt I could learn any of these to some extent — or at least learn what they mean! — but this is not how I want to spend my time, I just get no pleasure from it.
  
  I greatly appreciate that you took the time to write this, though. I’ve gone through it a couple of times and will pick and choose what to take from it. So thanks.
  
  Phil
  
  Reply ↓
David in Tokyo on December 8, 2022 11:57 PM at 11:57 pm said:

The bad news here is that there’s a big difference between being an academic and being a programmer. My understanding of the current programming world (which could be wrong, I’ve only done small-scale programming) is that your generic programmer with some number of years of experience spent those years actually writing code, and thus has actually written a lot of code. And that that’s a very different thing from people like myself (MS comp. sci.) who never wrote a big program.

As a confirmed academic, it’s my deeply felt belief that academics ought to be able to to compete with such real-life programmers, if they put in some time and thought. It may requried more time and thought than you are interested in putting in, but you (Phil) apparently already have a sense of what “good” and “bad” code is, so your problem is to learn to write “good” code from the start. The problem is how do you leverage your academic chops to do that in a small fraction of the time your generic non-academic programmer has succeeded in doing that.

I’ve been occassionally reading Derek Jones’ blog, and I think you might find it worth your time to read what he has to say about programming, both academic and practical.

https://shape-of-code.com/

Reply ↓
- Josh on December 9, 2022 10:17 AM at 10:17 am said:
  
  “As a confirmed academic, it’s my deeply felt belief that academics ought to be able to to compete with such real-life programmers, if they put in some time and thought.”
  
  David, I’m a bit skeptical about this. Of course, there are diminishing returns on experience, etc. but in general, I don’t see why an academic is going to get as good as a programmer with five years of programming experience without themselves getting five years of programming experience.
  
  Reply ↓
  - David in Tokyo on December 9, 2022 11:29 AM at 11:29 am said:
    
    Yes, it’s not easy. My thought is that understanding the field and the problems that needs to be solved well (in particular, that you are specializing in a particular class of problems that you know well), you should be able to more quickly acquire an insight into what’s needed in applications in your area.
    
    The question then is how do you acquire the basic programming approaches and idioms that are going to be needed.
    
    That is, the object is not to complete with good programmers in general, but to be a better statistical modelling programmer.
    
    Reply ↓
    - Josh on December 9, 2022 3:07 PM at 3:07 pm said:
      
      Ah okay. We’re in agreement then.
- Phil on December 9, 2022 1:52 PM at 1:52 pm said:
  
  Like most skills — maybe all skills — one improves with practice. Some people are quicker studies than others and I could believe you can find relatively inexperienced programmers (from academia or elsewhere) who are better than the average experienced programmer. But I think that in general experience is going to win out.
  
  That said, I’m not sure how relevant this question is to me specifically. Although most of my work at Lawrence Berkeley National Laboratory could fairly be described as ‘academic’, I still had to instantiate all of my models in code, and debug them and test them and run them. And for the past seven years I’ve been doing consulting. I think I’ve worked for nine different companies at this point, and for a few of them I’ve done more than one project. None of these have been ‘academic’ types of efforts, although some of the work and results could be published in refereed journals.
  
  Two things that have been true of my consulting work that were rarely true of my academic work are (1) several of my projects have involved working with other people who touch at least some of my code (and vice versa), and (2) in a couple of my projects the goal has been to write software that will be used by people other than those of us who are writing it, and that will be maintained in the future by other programmers.
  
  Item (1) has been important in two ways. First, I’ve gotten to see how ‘real’ programmers write their code, so I can try to emulate that. Nothing was ever stopping me from looking up examples of good code, but in practice I didn’t do it. But when I’m working on a project with someone else and am immersed in our joint codebase I do see good code and I see the differences between theirs and mine. This is helping me. Second, I’ve sometimes gotten direct feedback that a way I have done something is confusing or inefficient, which of course has also helped me learn.
  
  Item (2) has forced me to think more about exceptions and corner cases and error handling and all sorts of things I’m not used to worry about. If I’m writing a model in order to analyze a specific dataset, I just need to develop the code until it works satisfactorily on those data. If the dataset is complete — no missing values of any of the variables — then I don’t need to worry about how to handle missing data. If I think that in the future my client might want me to work on a dataset that has some issues with missing data, I can wait and see if that actually happens. But if I’m working on a program that the client can use then I need to think about missing data issues from the start. And many other issues too. Maybe my development dataset has times recorded in UTC, but in the future they might use local time. That means some days will have only 23 hours and others will have 25…will the code still run and will the model still make sense, especially near the time changes? Etc. etc.
  
  Sorry, I’m rambling. David, thanks for your comments and for the link to shape-of-code, I will check it out. Josh, thanks for your comment too, which I pretty much agree with.
  
  Reply ↓
Jonathan (yet another one) on December 9, 2022 12:32 AM at 12:32 am said:

“The modeler has to write the production code” has been common in some sectors over the years but not others, e.g. traditionally quants in investment banking wrote production code, but risk analysts in retail banking did not. My impression is that it’s grown in popularity, helped by Python and R being viable for both research and production use, and by essentially all recent STEM graduates knowing Python. This is different from “the modeler has to integrate their code into production”, which is much less common than having modelers write code within a framework set up by someone else; often modelers essentially only have to write some functions and not think about the packaging, deployment, workflows, etc. That’s not much more burdensome than writing model development code in a non-production language/environment.

On the plus side it helps avoid a host of failure modes, e.g. (1) modelers do not use version control (2) model development / monitoring code is not reproducible (3) model algorithm or implementation is grossly (e.g. 100x) inefficient (4) modelers “throw over the wall” incomplete/incorrect specifications (5) unnecessary work of syncing two code bases (6) finger-pointing when something fails (7) mismatch between bandwidth of different teams (8) model grows stale while waiting to be productionized. Essentially you’re having one team be responsible for more of the end-to-end model lifecycle and getting benefits from that.

Reply ↓
actinium226 on December 9, 2022 1:57 AM at 1:57 am said:

> I tend to have a procedural rather than an object-oriented way of thinking of things,

Functional programming is the way to go. Look up pure functions and give them a try.

Reply ↓
Stefan Krawczyk on December 9, 2022 1:58 AM at 1:58 am said:

Nice post. A few thoughts:

(1) I believe the high need for knowing python well, is a result of the current state of tooling, and most people don’t question that fact.

(2) There are efforts to change (1), e.g. https://github.com/stitchfix/hamilton [disclosure: I’m one of the creators of it] is an example of trying to change ones working paradigm to reduce the software engineering bar to get maintainable production grade code written. In fact, its origin comes from helping a team tame their time-series feature engineering code base… (we’d love feedback ;) ).

(3) In terms of building models and producing value for the business, I believe a self-service model is much more efficient and effective, than a hand-off one. The person with the domain expertise with the model should ideally be making the trade-offs on how it operates in production; hand-off can result in “better engineered code” but that comes at the cost of iteration velocity and increased cost to operate.

(4) That said, to do (3) well, you generally need (1), and if I weren’t part of a platform team trying to help, there would be no (2). So if you can find a company with a platform team, or one willing to pay for vendors, it could greatly reduce the requirements to follow through on (1). So I would say go for (3) if you can, you can usually command a higher wage and change your title to have “engineer” in it as you get better at it, than purely sticking to modeling things…

Reply ↓
- Phil on December 9, 2022 5:46 PM at 5:46 pm said:
  
  Stefan,
  What do you think of something that is neither self-service nor hand-off? On one of my current contracts I am working with a couple of programmers. I wrote a couple of time series models and wrote simple wrapper functions for them. You get a price forecast by calling the price model and a load forecast by calling the load model. One of the programmers incorporated these into the rest of the system and the other programmer wrote some test cases that can run automatically. Later I changed the load model but kept the function call the same so it just drops in where the old one used to be. I paired up briefly with one of the programmers so he could do some refactoring for performance (i.e. speed). I think this worked really well for us and it’s the work approach I’d like to use in the future when my statistical models are going to go into production code. I don’t think this could really be described as fully “self-service” or as fully “hand-off”. Is there a reason this isn’t a standard approach? Could it be?
  
  Reply ↓
  - Stefan Krawczyk on December 18, 2022 7:26 PM at 7:26 pm said:
    
    Sorry, I missed your response…
    
    I think that way works better than hand-off. However it runs up against scale at some point. Either your time, or the people your partnering with’s time, likely the latter…
    
    In terms of how you did it, you created an abstraction (which is what a platform essentially is), that both of you understood and could operate by. If you have to do that same cycle a few more times, my guess is that you’d extend things in a way such that you could probably do it all yourself? Otherwise that’s essentially my playbook, when I was building platforms. What you describe is what we did the first couple of times around, and then kept adding to so that the engineer wouldn’t need to be bothered when things changed, while the model owner could do it end to end themselves :).
    
    Reply ↓
Rik on December 9, 2022 3:30 AM at 3:30 am said:

Oftentimes there’s a pair of roles putting models in production: a data scientist, who does most of the modelling, discussions with the client, etc., and a ML engineer, who builds the infrastructure around the model and helps the data scientist to get it in production.
I think this is a bit closer to what you’re envisioning, restating it in these terms might help your discussions!

Reply ↓
- Phil on December 10, 2022 1:40 AM at 1:40 am said:
  
  This is definitely the approach I would like to take. From my job search thus far (which admittedly only goes back about two weeks) I haven’t seen any jobs described this way — none say “you’ll write models and work with a programmer to put them into production” — but perhaps you’re right that if I learn the right lingo I can make the pitch for this way of working.
  
  Reply ↓
Richard T on December 9, 2022 5:15 AM at 5:15 am said:

Top resource to read once you are quite comfortable in Python is Effective Python by Brett Slatkin. Extremely useful for levelling up your Python development skills.

Also I would love to learn more about energy forecasting so if you would like to pair up any time to do some knowledge exchange, please do reply to this comment. I’ve been working in Python for a long time and am currently in a team that writes both research and production code in a shared codebase with engineers. I don’t think we’ve totally nailed the process yet, but we’ve definitely got to a point where we can both experiment efficiently and deploy models rapidly.

Reply ↓
- Phil on December 9, 2022 1:55 PM at 1:55 pm said:
  
  Richard,
  I’m open to the idea of collaborating on something, although I also know other programmers who I’ve worked with in the past and to be honest I’d be more likely to try to bring on one of them if the need arises; indeed that’s what I’ve done on a current project, as described in the original post. But in any case, first we need a client! Definitely let me know if you find one!
  
  Reply ↓
Timothy Hobbs on December 9, 2022 5:37 AM at 5:37 am said:

Hi Phil (or any other qualified coders out there),

my names Tim. I’ve been coding python for over 8 years and I’ve been programming for around 20. I’d love to get into this space and I think that a collaboration between

– me: strong coder/devops

– and you: qualified energy modeler

Could be a make a powerful freelancing team. If you’d like to join forces with me you can find me at [email protected] or book a meeting with me at https://cal.com/timhobbs

Reply ↓
- Tim Hobbs on December 9, 2022 5:38 AM at 5:38 am said:
  
  I meant to direct that at modelers not coders, though I’m open to working with coders too.
  
  Reply ↓
Anonymous on December 9, 2022 9:37 AM at 9:37 am said:

> I write fair but not good Python code, and certainly not excellent Python code; if I were being graded among professional Python programmers I’d be hoping for a B- but expecting a C. If a company says that they need excellent Python skills, and they mean it, then I’m not the right person for that job.

Asking for skills is easy, asking for excellent skills is just as easy.

I bet your Python coding skills meet the expectations of 50% or more of those job postings (not what they might be hoping for on that dimension alone, but more than enough to be excited about your candidacy overall)

Reply ↓
Patrick on December 9, 2022 9:37 AM at 9:37 am said:

> I write fair but not good Python code, and certainly not excellent Python code; if I were being graded among professional Python programmers I’d be hoping for a B- but expecting a C. If a company says that they need excellent Python skills, and they mean it, then I’m not the right person for that job.

Asking for skills is easy, asking for excellent skills is just as easy.

I bet your Python coding skills meet the expectations of 50% or more of those job postings (not what they might be hoping for on that dimension alone, but more than enough to be excited about your candidacy overall)

Reply ↓
- Phil on December 9, 2022 2:03 PM at 2:03 pm said:
  
  Mmmmmaybe. I think this depends on whether they want a programmer who can do some modeling or a modeler who can do some programming. See Wes Turner’s comment upthread. If you are expecting a programmer who is familiar with all of that stuff, you’re not going to be happy with me. Nor will I enjoy the job if I have to become fluent in those frameworks.
  
  Reply ↓
Daniel Lakeland on December 9, 2022 4:47 PM at 4:47 pm said:

One of the things missing in this discussion is the idea of contract/interfaces to allow the modelers to interact with “production” in a reasonable manner without becoming “production engineers”.

The best way to get modelers involved in the production is to give them pretty well defined interfaces that they can work with. For example:

Daily at 1PM PST we will write the file “bigstuff_DATE.csv” to the directory /data/inputs/ and then call the script “/opt/bin/foo” whose job it will be to write “predictions_DATE.csv” and “qualitycontrol_DATE.csv” and “optimaldecisions_DATE.csv” to the directory /data/outputs/ within 90 minutes of being called. Once all the outputs are generated the script will touch a file “completed_DATE”. The outputs will contain….

I mean, for a batch process for example that’d be a good plan. Then the “production” guys can work with well defined things they need to get the modelers, and the modelers can work with well defined things they need to get the production guys. The production guys can ensure there’s a process which runs continuously, has all kinds of error checking, generates pretty visualizations, whatever… without the modeler guys having to understand all the intricacies of the entire production machinery.

Is that sort of thing not common?

Reply ↓
John Mashey on December 10, 2022 2:13 AM at 2:13 am said:

On object-orientation in R & Python: John Chambers’ S, the ancestor of R, started in mid-1970s, 15 years before Python, and well before the 1981 BYTE issue on Smalltalk* hugely increased the interest in object-oriented approaches, although C++ was more inspired by Simula, of course.

I recent got John’s oral history of S & R for the Comptuer History Museum:
transcript: https://www.computerhistory.org/collections/catalog/102792762
video: https://www.computerhistory.org/collections/catalog/102792763

*That’s https://archive.org/details/byte-magazine-1981-08 which was remarkable, because BYTE August issues usually focused on languages that were widely available, which Smalltalk wasn’t. People were doing ports, but for reasonable performance one really wanted a Xerox Alto or better a Dorado at that point. As amusing history, in 1990s a key Smalltlak person, Adele Goldberg was doing project written in Python, so she got me to review it (for general software engineering), along with Python author Guido von Rossum.

Reply ↓
- Carlos Ungil on December 10, 2022 5:09 AM at 5:09 am said:
  
  Note that S3 (1988) is when they started to give the name “object” to the data structures and to the functions (which previously were macros and not defined as S objects) and introduced attributes, method dispatching, etc.
  
  “A first hint at the object-oriented features of S came about with the 1988 release. The function print was made to recognize an attribute named class; if this attribute was present with a value, say, xxx and if there was a function named print.xxx, that function would be called to handle the printing.”
  
  A Brief History of S (Richard A. Becker) https://ungil.com/94.11.pdf
  
  “Described initially in Chambers (1987) as a language separate from S, this research later merged with other changes to form the next version, labeled S3 and described in the “blue book,” Becker, Chambers and Wilks (1988). The slogans in Section 2.3 were basic to this version of S: everything is an object (stated explicitly) and function calls do all the computation (implicit).”
  
  Object-Oriented Programming, Functional Programming and R (John M. Chambers) https://arxiv.org/pdf/1409.3531.pdf
  
  Reply ↓
  - John Mashey on December 11, 2022 12:18 AM at 12:18 am said:
    
    Good references, Becker’s history brings back old memories and many familiar names, he mentioned various books named by colors … John brought those along for the oral history.
    
    Reply ↓
- Phil on December 11, 2022 8:05 PM at 8:05 pm said:
  
  Wow! John Mashey! Giants walk among us.
  
  Reply ↓
Anonymous on December 10, 2022 7:04 AM at 7:04 am said:

It could be much worst than this. I am currently working on a project where I started as data scientist working with python (extracting various patterns from bio-medical data) and ended up as a Scala/Java software engineer. Initially I just was drafting algorithms in python which SW engineers had to turn into a working product, but there were so many problems that at some point I started writing the production code myself. Once the managers figured out I can do this, they immediately turned me into a SW engineer. I was protesting, of course, but they just offered much more money and I succumbed.

Reply ↓
- Phil on December 10, 2022 1:30 PM at 1:30 pm said:
  
  My friend Shawn is like this, a bit, at least on one project. He’s the frontend programmer I mentioned above, who has taken over a lot of the stuff Clay and I were doing. He’s a good modeler himself and that’s what he prefers doing, but he is also a good programmer, certainly better than me or Clay. So we gave him the task of programming the front-end, and I kept the modeling mostly to myself. Not entirely — Clay has noted some issues and he and I have often discussed possible model improvements, which I then try out — but basically he is doing the stuff I don’t like, and he’s not getting to do his favorite stuff. His preferences are not so strong that he’s hating the way it worked out, but it’s not ideal from his perspective.
  
  For myself, I do find that I am slightly reluctant to improve too much at programming because what if that means I have to do more of it? No doubt I would dislike it less if I get better at it, but I think I know myself well enough to know that I’m never going to really enjoy it. Take a look at Wes Turner’s comments above… spending my days on that sort of stuff would make me very unhappy. Of course, nobody could make me do it. Or you. You could have refused, either turned down their ‘promotion’ to highly paid software engineer or gone to another company and remained a data scientist. But I could imagine sliding down the slippery slope, or whatever is the right metaphor: just find myself doing more and more programming until that’s how I spend most of my time, without ever quite having something trigger me to quit and look for a different job.
  
  Reply ↓
A on December 10, 2022 3:29 PM at 3:29 pm said:

Python discrimination is real. Many employers outright demand Python only applicants. I believe they in this way also miss out on a lot of talent.

Many things are elegant in R that are quite awful in Python.

Python itself gets its speed from numpy, numba, jax, pytorch, and a number of hacks that make Python tolerable.

Another example: it is trivially easy to bundle your R code into an R package to let others tinker with one’s work. A Python wheel is a different story.

Employers should understand be more flexible about the du jour language and rely on data as the interchange between systems.

My guess is that employers who demand Python only solutions do this to their long term detriment and perhaps, peril.

Reply ↓
Linux user on December 11, 2022 11:13 AM at 11:13 am said:

Well this happens only because for every company that uses R you will find at least a thousand companies that use Python. The ratio of R users and Python users are also similar.

Almost all large companies uses Python. For example, Google and Red Hat use Python wherever possible, Google is the one who created Tensorflow.
R has no such corporate backing.

Reply ↓
- Phil on December 11, 2022 2:53 PM at 2:53 pm said:
  
  1000x is clearly hyperbole. Even 100x would be clearly hyperbole. 10x seems possible though.
  
  Here are a couple of “R vs Python” articles:
  https://towardsdatascience.com/python-vs-r-for-data-science-cf2699dfff4b
  
  https://www.ibm.com/cloud/blog/python-vs-r
  
  I know R is fairly heavily used in bioinformatics, I think largely because of bioconductor https://www.bioconductor.org
  
  Here’s an article that says that based on stackoverflow, R is still very popular (although not nearly as popular as Python!) https://towardsdatascience.com/python-vs-r-for-data-science-6a83e4541000
  
  Anyway, as I said it is unsurprising that everyone in my general area of work is doing their production code in Python rather than R. And although Andrew is right that it’s not hard for a Python person to translate from R to Python —well, at least it’s not _usually_hard, it would be hard if you use an R package that doesn’t have a Python equivalent — it seems pretty reasonable to me that companies would want their models created in Python so they can more easily translate them to Python production code. The only thing that seems odd to me is that they want the person writing the models to be the same person writing the production code. These people wouldn’t hire Andrew Gelman to do their statistical modeling! Seems crazy.
  
  Reply ↓
Bob76 on December 11, 2022 11:15 AM at 11:15 am said:

Well, I know one case in which R comes out a little better.

A friend of mine who is a senior person at a large economics consulting firm told me that the firm used to have a rule agains using open-source code. Apparently, their computer security and legal types had urged this rule.

However, they discovered that many of their recent PhD analysts were using R. Sometimes they would submit results straight from R. Other times, after finding a useful result, they would redo the analysis using an authorized tool. After this discovery, the firm gave up the rule and authorized R.

Bob76
PS. I know Python is also open source. The authorized tools were products like SPSS, SAS, etc.

Reply ↓
Andrew on December 11, 2022 11:21 AM at 11:21 am said:

The good news is that these languages aren’t so different from each other. I don’t know Python (sorry! I’m not proud of this, I’ve just never gotten around to learning it) and when I work with colleagues who program in Python, I just write my code in R and they can translate as needed. That’s just fine because I shouldn’t be writing production code in R or Python or Fortran or anything else. So my code just needs to be clear enough for my collaborators to understand what it’s doing.

Reply ↓
- A on December 11, 2022 6:19 PM at 6:19 pm said:
  
  Very true. Except maybe for the occasional
  
  deparse(substitute(x)) in an R function with x as an input arg or similar…
  
  ;)
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

The cleantech job market: Every modeler is supposed to be a great Python programmer.

70 thoughts on “The cleantech job market: Every modeler is supposed to be a great Python programmer.”

Leave a Reply Cancel reply