Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
If You Want Better Code, Do It For Me (with Jonathan Schneider) image

If You Want Better Code, Do It For Me (with Jonathan Schneider)

Developer Voices
Avatar
1.7k Plays10 months ago

A lot of programming is split into the mechanical work of writing what you know, and the creative work of figuring out what you don’t know. Wouldn’t it be nice to automate the mechanical stuff away?

Well the good news is we’re already automating a lot of it. Every time you run a refactoring tool or a pretty-printer, you’re handing boring work off to the computer. But how does that magic work, and how can we do more of it?

This week we’re joined by one of the authors of OpenRewrite—Jonathan Schneider—to learn how automated code-rewriting tools really work. From the basic approach to the hairy corner cases, to the reality of keeping developers happy with the subjective side of the results.

It takes a lot of work to automate work away - this week we’ll learn how the work gets done for us too.

OpenRewrite: https://docs.openrewrite.org/

Supported Languages: https://docs.openrewrite.org/recipes

Moderne: https://www.moderne.io/

Gradle Lint: https://github.com/nebula-plugins/gradle-lint-plugin

Chicory (Native JVM WASM): https://github.com/dylibso/chicory

Call Java from Haskell: https://github.com/tweag/inline-java#readme

Call Haskell from Java: https://github.com/nh2/call-haskell-from-anything

Kris on Mastodon: http://mastodon.social/@krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

Kris on Twitter: https://twitter.com/krisajenkins

#podcast #software #programming #softwareengineering #refactoring #parsers

Recommended
Transcript

Key Tools in Software Development

00:00:00
Speaker
If you're writing software there's a very good chance that you're in the habit of using a refactoring tool or a linter or a code formatter or maybe even a code searcher that's more advanced than just grepping. There's actually a fair chance you've used all four of those tools very recently because they're all fundamentally useful tools no matter what language you use.
00:00:23
Speaker
they're also fundamentally operating in a similar way so you might ask how. You might know conceptually it's not that hard. What you do, I think of it as a two-phase process, what you do is you write a few regular expressions and then you write a few more regular expressions and then their weight and complexity drives you completely mad and you spend the rest of your life in an asylum gibbering.
00:00:48
Speaker
And then phase two is your children grow up and they realize that was the wrong way to do it and they do it the right way by writing a proper code parser.
00:00:56
Speaker
And the parser takes the text file and sends it into a logical representation of the structure of the code, which gets called an abstract syntax tree. And then you look at that tree, and either you rewrite the tree for refactoring, you look through it for searching, you critique it for linting, or you just write it back out neatly for pretty printing. That's the basic idea.
00:01:21
Speaker
Now we dig in properly because there are a lot of juicy details once you pop the lid off it.

Introducing Jonathan Schneider and Open Rewrite

00:01:27
Speaker
I'm joined today by Jonathan Schneider and he's one of the authors of Open Rewrite, which is an open source, refactoring and code analysis tool that actually handles a surprisingly large number of different languages and different file formats and works all those different tricks on them.
00:01:45
Speaker
And Jonathan's going to take us through how you really build a system for wrangling source code, including all the thorny edge cases like languages that are sometimes white space sensitive and developers that are manically white space sensitive. That comes up a lot too. So let's get started. I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is Jonathan Schneider.
00:02:21
Speaker
My guest today is Jonathan Schneider. Jonathan, how are you? Doing super well. How are you? I'm very well, very well. I am looking forward to going back to algorithm school with you. Always looking forward to going there. Excellent.
00:02:37
Speaker
Before we get into the guts of how Open Rewrite works, I'd like to know how you find yourself in a world writing parses, galore, I assume, and abstract syntax, tree manipulators, and all that stuff. Where did you start?

Challenges at Netflix: Freedom and Responsibility

00:02:53
Speaker
Yeah, this story began almost eight years ago. I was working on engineering tools at Netflix. And at the time, Netflix, pretty small engineering team relative to the other sort of fang companies in the area that they had this special cultural tank called freedom and responsibility, which meant as a member of a central team, you couldn't impose any constraints on what product engineers did.
00:03:20
Speaker
So I find myself on engineering tools trying to help people move forward, you know, for their own benefit as well. Like that could be moving forward from one language version to another, or we were trying to get off of an old logging library called Butts4J that we had written internally and we're trying to move to SLM4J. So this kind of
00:03:45
Speaker
you know, migration activities that didn't affect just one product team, but the whole company, we're trying to get folks to move along and spend quite a bit of time on reporting. So just the act of like finding the issue, surfacing it to developers in a way that would kind of let them know there was something to be done. And when, you know, honestly, pretty, pretty far in that we,
00:04:09
Speaker
like, you know, would report some sort of defect in the build log, made sure it was at the end of the build log. So it's the first thing they saw, colorized the output of the build log, colorized Jenkins output, just kind of went all the way. And it resulted in approximately zero action on that part of developers. So we've got
00:04:33
Speaker
what would it take for you to do this, right? And they kind of answer sarcastically, do it for me, otherwise I've got something else to do.
00:04:42
Speaker
They didn't need an external team coming to them with extra problems. I can sympathize with that. I can also sympathize with the position where you must have loads of different projects in loads of different languages, with loads of different versions and no consistency at all. That's a pretty good amount of ability culture. I guarantee that there is not a high degree of consistency between the code in one product team versus another.
00:05:11
Speaker
OK, so I know you didn't go to be tempting to change that culture and have a bit more enforcement of rules. I know that's not the road you went down, given your pragmatic road.
00:05:23
Speaker
Yeah, and I would say there were certain attempts in that direction. There was definitely the like, maybe we should do a lot more unit testing here inside the company. And these kinds of things just tended to not latch on inside of it. So I think we started with something fairly small.

Gradle Lint Project: Successes and Challenges

00:05:42
Speaker
There was an open source project that still exists today called Gradle Lint, which was trying to just manipulate dependencies in Gradle Groovy files.
00:05:52
Speaker
Which seems like a pretty hard general problem because Gradle as a build system is expressed in Groovy, so that's a term complete language.
00:06:04
Speaker
In theory, depending on how funky those files get, that could be very nasty. Yeah, it could be. So that was awesome. Gradle lint was an AST manipulation sort of thing. But the scope of the problem was very small. So I think we started with that, just trying to manipulate or massage dependencies and other configuration in Gradle files. And then pretty quickly, people started saying, what if we could do this in the main source code as well, outside of just the build files?
00:06:34
Speaker
Yeah. So how successful first was the Gradle manipulator? I mean, how well did it work in the face of Turing complete configuration files? I think it's great to be really, really focused on how much you can achieve pragmatically and not theoretically here. So
00:06:57
Speaker
90% of the time people define independency in a very vanilla, plain way. Sometimes they iterate over a list of something and then map it to something else and somehow dependency comes out of that. But if you just discard the exotic cases for a little bit and think, can I solve a large part of the problem, then we actually got pretty far with that.
00:07:22
Speaker
You tackled Java Next, is that how it evolved? Java Next. The Netflix microservice in Ecosystem at the time was predominantly Java, with the exception of the studio team, which I think was right in Ruby, but it was more or less Java across the board.
00:07:46
Speaker
This is already pretty ambitious, right? Because you must be writing a parser. Are you writing a parser for Groovy and now writing a parser for Java, or are you hooking into their parsers? Definitely hooking into theirs. So Groovy had some pretty great tools around that already, and some prior art to kind of work with. And Java, of course, the compiler itself has written Java.
00:08:11
Speaker
The parser is quite good, as you can imagine, inside of the job compiler. So rather than starting from the syntax, we started from already a richer representation, which was the compiler internal AST. OK, so take me through this. If I wanted to write a Java syntax tree rewriter, what would I do?

Creating and Transforming Syntax Trees

00:08:38
Speaker
In general, for any language, there's sort of multiple steps. You've got the source code as text. And if you go back to your kind of CS class on compilers way, way back when, you have to take that text and tokenize it and turn it into an abstract syntax tree. So there's technologies in the Java ecosystem and others like Antlar, which are fantastic for prescribing grammars and generating parsers off of those grammars.
00:09:09
Speaker
But that's going to get you a syntax tree. So that's going to turn just the text of the code into that tree of syntax nodes. But say you're looking at a method invocation, like I'm looking at a call to the method add on a list. I will have some syntax node, which is a method invocation node, where the simple name of the method is add.
00:09:36
Speaker
From just looking at the syntax, it's not apparent that the receiver type of that method call is list.
00:09:44
Speaker
Okay. So you just see the word add. I don't know if it's list.add. I don't know if that's set.add. I don't know if it's headers.add. So the syntax tree has already lost type information. Is that what you're saying? It's almost like it hasn't yet gotten type information. So you're starting from the source as text. Oh, right. We haven't run it through the
00:10:06
Speaker
The first step is to produce that syntax tree. Yeah, of course. OK, so a compiler second phase or the second step is to take that syntax tree and now go and start solving for types. So that's its responsibility is to go. I see this method add. I need to figure out which add it is.
00:10:25
Speaker
And so that can be a very complicated process, as you can imagine. It's looking up imports. It's looking at what's on the class path. It's trying to uniquely solve for what ad is in this situation.
00:10:38
Speaker
Yeah. Yeah. And fun new stuff, fun, recent stuff like type inference along the way. Absolutely. Absolutely. Yeah. Yeah. Got generics involved. You got so that can be a really complicated process. So when I started working on open rewrite, rather than starting from an antler like parser,
00:10:56
Speaker
I thought it's best to start from the richest parser available, which is a compiler. And so OpenRewrite's Java, we call this lossless semantic tree, is mapped from the internal compiler AST. Right, you've just introduced two new terms we need to break down there. So lossless semantic tree, what's that and why do I care?
00:11:23
Speaker
Yeah, absolutely. So the loss of semantic tree or LST is the syntax, the syntax tree like we were describing, plus all that type attribution information, the solved types for what ad, where ad is coming from. Plus, there's one other thing, which is we have to go back to the original source code and go find all the white space.
00:11:48
Speaker
and comments and things of that sort, because that stuff is usually discarded very, very early on in the production of the abstract syntax tree. So all that kind of white space gets bolted on to this lossless semantic tree as well.
00:12:05
Speaker
Because you want to be able to spit out a new version of a file that has all the same white space and syntax, but a new method or a new variable name. And that's why we can't use the compiler internal AST directly, because they really have different purposes. The compiler internal AST is an intermediate representation
00:12:29
Speaker
on the way to producing bytecode. And of course the white space and things of that sort are not relevant to the bytecode. But they're highly relevant to actually printing back out as text of source code and producing a diff. So that's... I can see that. I would have... Do they not hold on to some of that for the sake of error messages?
00:12:51
Speaker
Yeah, generally, no. Most parsers will just throw away white space as they're tokenizing the source code. So as part of that tokenization, it may trip over a particular token and fail right there and know where it was. But yeah, it tends to be discarded very early on. And even defining the grammar as something to be. Just throwing away. Yeah.
00:13:20
Speaker
Yeah, I suppose maybe they're just holding on to line number and column number so that they can go back to the source code and show you that. So you're saying you've hooked into Java's parser and then it does the type resolution step.
00:13:37
Speaker
You hook in there, grab that, and then go back to the source code and chew in white space and stuff. So we wind up implementing a visitor over the internal compiler AST where we're going down all the compiler internal AST types and mapping them one at a time over to open rewrite LST model elements that correspond to each one of those. And while we're doing that, we have a position in the source code that we're advancing as we go.
00:14:07
Speaker
and so that we can always see what the prefix or white space on an element really is.
00:14:14
Speaker
Is this like saying, OK, the next token I'm expecting must be an ad. So give me all the white space until we find an ad or explode. That's exactly right. Yeah. And so if the parser is done perfectly, you never have a mismatch of expectations. But that really is. Never say never. That's right. That's right. It takes a few iterations to get to that point. Yeah. OK. So how much?
00:14:41
Speaker
Have we got the sense of all the data you've added to this LST, lossless semantic tree? Almost, but then there's one other thing, and there's this concept we call markers, which marker is just kind of a bare interface that doesn't have any contract to it, but markers are just kind of like a bag of data.
00:15:03
Speaker
that we can hang on any level of the lossless semantic tree. In the very earliest forms, we only imagined markers as something that we would hang at the very top level. And so markers are things where we hold information like, what was the Java version used? What were all the types on the class path available at the time this was compiled, whether or not they're used inside the file?
00:15:27
Speaker
What were the transitive dependencies of this project at the time that it was compiled, and how were those transitive dependencies computed? What was influencing those version selections and so forth? All that information. You could do something like you could add some new code that uses ArrayList, and you'd know whether you need to add the import statement. That's right. And whether it will work. Okay, yeah.
00:15:52
Speaker
Or we can have conditional recipes where we'll say, I'm trying to get, you know, like a very verbose statement to a simpler one. And, you know, if Google Guava is on the class path, I'll do this. If Apache Commons is on the class path, I'll do that. So that you can, you know, produce a change that introduces as few additional new dependencies as possible.
00:16:18
Speaker
Yeah, yeah, that makes sense. And probably sticks with the conventions of the project at large. That's right. That's right. And in changing that concept, that concept is, you know, makes the change look idiomatically consistent in the context of each project that the change is being applied to. OK, that makes me wonder. I'll keep going in order, but like an abstract syntax tree, my mind is forking down different roads we can go. But let's stay on this for a bit.
00:16:48
Speaker
If you've got this lossless semantic tree, your next step, I assume, is to transform it into a different lossless semantic tree of what you want the code to be. That's right. That's absolutely right. Which is going to involve you've got a tree of Java classes and you're rewriting that tree. That's right. Yeah. So every change is really just to change on the tree in some way or another.
00:17:16
Speaker
Right. Take me through an example. Yeah, absolutely. I give this example all the time of changing every integer literal to 42 because 42 is the answer to life, the universe, and everything. In that case, in this recipe, you would implement a visitor. A visitor you can think of as like an event-driven mechanism where you can intercept just literals.
00:17:46
Speaker
And you don't care what's surrounding that literal. Is that literal an assignment to a field in a class or is it?
00:17:57
Speaker
an assignment to a local variable, or is it part of a list array? I mean, a list array, where is it defined? You can just intercept just the literal, and each LST element has data elements on it. So in the case of a literal, it's got a value, and it's got a value source on it. And so you would take that literal, and you would say, with value, you would change it to something else.
00:18:23
Speaker
In the case of open rewrite, we have this pattern, this pattern called whither, these whither methods that the LST model elements themselves are all immutable value objects. And so when you call with on one of the data elements, you're actually constructing a new literal object and returning that new literal from the visit literal method.
00:18:48
Speaker
And that new object winds up getting, it basically bubbles all the way back up to the top. So if you create a new literal, whatever is contained and gets a new thing created and new thing created all the way up the line, we check at every level whether the thing that's been returned below me is the same reference.
00:19:09
Speaker
as what I started with. And if it isn't, then a new reference gets created so that by the time we get back all the bubble all the way back up to the top level compilation unit, the top level LST element, we can just do a simple referential equality check on that compilation unit to see whether a change has been made somewhere down the tree.
00:19:29
Speaker
Yeah, it's reminding me a lot of Git. When you change a file down in the tree, the hash bubbles up to the top. Yeah, we keep using the same tricks in computer science.
00:19:41
Speaker
OK, I'm going to have to push you on a more complicated example. That's for a literal. Let's say I want to do something that involves a subtree.

Rewriting Code and Flexibility Across Languages

00:19:52
Speaker
I found a faster way to add two strings together, so I want to replace all the string plus string with some function called on those two strings.
00:20:03
Speaker
Yeah, absolutely. Yeah, so we have two forms of visitors. One we call, like, so there'll be a Java visitor, which is designed to accept only Java LSTL. And so it kind of filters it at the top. And then you have this set of visit methods, visual literal visit binary visit method invocation all the way down.
00:20:23
Speaker
There's a special form of that visitor called the ISO visitor. ISO roughly stands for isomorphic in this case. In the case of that literal thing we were talking about, I would implement a Java ISO visitor because I'm getting a literal in and I'm always returning a literal. In this case, I'm just changing the value on it. In this scenario, I'm taking in a binary and I'm returning a method invocation.
00:20:51
Speaker
As long as it's just a regular job visitor, I can always just completely construct a different type than the one that was given to me originally. Okay. And I have access to walk down my little sub-tree to find out what's going on and manipulate that. Yeah, so you would intercept the binary. You're looking for string concatenation binaries. And whenever you find one of a certain characteristic, you just return enough identification instead.
00:21:19
Speaker
Yeah, okay, that makes sense. What about context? Can I go up the tree and find out where I'm being called, what I'm being called within?
00:21:26
Speaker
Yeah, absolutely. The visitor has a concept called a cursor, which just maintains a stack of all the elements that have gotten you to this point. And cursors themselves, the cursors are actually created and discarded as you go up and down the tree. They've got a message passing mechanism on them as well. So as you're kind of like working your way down the tree, you can put messages
00:21:50
Speaker
on a cursor element. And then you might wind up eight levels further down, and you could say, look up and get me the nearest message, or put a message somewhere higher up. So that when control returns to that higher level object,
00:22:07
Speaker
I have data that's been supplied to me from somewhere below. That makes sense. Is it the case that I'm always writing these rewrite rules in Java code? I think you're going to see very shortly as we're just about to land the Ruby implementation that we think of two phases of language.
00:22:37
Speaker
adding new languages. One, that we're able to write a refactoring recipe in Java to transform that language because the core framework is written in Java. But then the second phase is that you're able to write a recipe in the target language that you're trying to manipulate.
00:22:56
Speaker
And how we're going to achieve that is very different depending on the language. In the case of Ruby, we've got JRuby, right? So it's very trivial actually to write the Ruby program, a Ruby recipe and execute it on the, you know, the sort of like general infrastructure for other languages. Until recently, this was a much harder proposition.
00:23:21
Speaker
Yeah, because I'm thinking, let me try and pick a good example, Terraform. I know you can rewrite Terraform.
00:23:29
Speaker
Is that something where you're hooking into the Terraform parser? And what prospect have you got of rewriting those in non-Java? Absolutely. In the case of Terraform, we did actually write an ANTLR grammar for Terraform and based it off of that. Now, notice in Terraform, there's no type attribution. There's no type solving. So for things like XML, JSON, those sort of things, we tend to start with those
00:23:55
Speaker
parser grammars like until our style parser grammars are built from there. Yeah, most of those sort of config file things probably I'm guessing isn't that hard to write a parser.
00:24:07
Speaker
It really depends. And in the case of terraform, terraform was actually much harder than it appears on the surface because there's a lot of accidental, I think, grammatical ambiguity to that language that made it almost impossible to write an analog for grammar for it. I think we were able to do it. I think it goes to show, like,
00:24:32
Speaker
You know, friends don't let friends write new languages, right? Unless you're a really, really experienced language engineer, which I think Mitchell was not. You shouldn't be doing this. It's very difficult to write grammatically unambiguous languages.
00:24:51
Speaker
Yeah, I can believe that. I think particularly in the DevOps world, there's a tendency to start with a config file and keep adding features until you accidentally find it's Turing complete.

Power and Danger of Turing Complete Languages

00:25:05
Speaker
Absolutely. I still remember the proof of XSLT's Turing completeness way back when. Why does that not surprise me at all?
00:25:18
Speaker
It was a real shock at the time when I first saw that proof, that's for sure, but that really has influenced my thinking ever since. Yeah, yeah. It makes you wonder what cheering completeness actually means to us, right? I mean, is it a measure of power or a measure of danger sometimes? Well, from the language and perspective, I see it as danger. Yeah, definitely could be in this danger. You must have particular feelings about YAML then.
00:25:50
Speaker
There's an interesting point here with YAML, which is that one of the surprising things is we developed new languages. We thought we were going to be working on completely different models for each language, different LST models. I remember when we started looking at JavaScript and building up Rewrite JavaScript,
00:26:10
Speaker
we found that there was like 90 plus percent similarity between the structure of the LST model of JavaScript as Java. And so in a way that's not surprising, they're both broadly C family languages, so there's a lot of
00:26:27
Speaker
you know, a common history to them, even though they look very different in source code. So when we went to involvement that JavaScript, the LST model actually extends from J, which is the kind of root Java model.
00:26:43
Speaker
which meant that there was some reuse actually between these languages as well. The change method name recipe written for Java actually works automatically in JavaScript. And so it's commonly thought that YAML was just JSON in another form.
00:27:03
Speaker
there is actually no shared numbers to be toward the two, even though, conceptually, they're very similar, it resembles quite different structurally. Okay. So, JavaScript, hang on, because I can believe that JavaScript and Java are roughly similar, especially as JavaScript was explicitly trying to ape Java in many ways, because of popularity at the time.
00:27:31
Speaker
But do you not gain... I mean, like, JavaScript has some really weird... It does. ... corner cases about coercion and the meaning of whitespace and that stuff. Absolutely. And so, you know, coercion, like, the way that the languages treat their types doesn't necessarily relate to the structure of their syntax. So the syntax tree, the syntax model, the model elements are very, very similar.
00:28:01
Speaker
We call these grammar islands of self-similar languages on that same grammar island with Java and JavaScript are groovy and Kotlin, not surprising JVM languages, although Kotlin looks very different from Java.
00:28:14
Speaker
So maybe that is surprising. C-sharp is on that same grammar island. Python is on the same grammar island. Believe it or not, Ruby is on that same grammar island. And Ruby is quite... The one that surprises me there is Python, because I would have thought what being white space sensitive, white space limited blocks would change things.

Ensuring Parse-to-Print Idempotence

00:28:36
Speaker
It only changes things in that there's less freedom to choose alternative white space.
00:28:43
Speaker
formatting, I think is really what it comes down to. Okay, then let's move on to the third stage of this, because once you've got a new abstract syntax tree, you need to print it out. That's a whole extra boatload of work.
00:29:01
Speaker
Yet the printer itself is actually a visitor. So it's a visitor that just accumulates information to some sort of string builder or offender as it's walking down the tree. So that's definitely part of any language implementation. We tend to develop the printer and the parser simultaneously. That makes sense.
00:29:28
Speaker
And so we'll build a whole suite of unit tests that represent all the syntactic variation that we can discover. And the success criteria is that we're able to take a piece of text that represents that example syntax, parse it into that LST, and then print it back out losslessly.
00:29:52
Speaker
We call that Arts to Print idempotence, that you restart from source code, we should end with that same source code.
00:29:59
Speaker
Yeah, so your success criteria is having done an enormous amount of work, you achieve nothing. That's right. In a very interesting place to build. That's absolutely right. So when I was first writing the Java parser, the goal was to take the entire 20,000 Netflix microservices or whatever it was and prove parse-to-print items on 100%. Right. How long did that take to get to that point?
00:30:30
Speaker
I would say it took about four months for that initial implementation. That seems a lot faster than I would have expected.
00:30:37
Speaker
And that speaks, I think, to the quality, in this case, of the internal compiler AST. It really is a work of art. I mean, I think depending on the language, the internal compiler ASTs are, well, there are various qualities, depending on the language, depending on the implementation, but the job of life is really an amazing piece of work and engineering.
00:31:06
Speaker
Okay, some people are going to think me mean for asking this, but I can't resist. Have you tackled PHP? And how's that? I haven't even looked at PHP, though. I'm curious, though. I hear Ruby's syntax is famously spicy. That must be a challenge enough. Very spicy. I'm pretty confident you could not define any of our grammar, or any sort of formal grammar for Ruby. I think it's, it's
00:31:35
Speaker
Although there's a lot of accidental complexity in Ruby. Can you give me an example?
00:31:41
Speaker
Yeah, I think there's this example of a ternary expression where you could say, assuming there's no spaces here, 1 equals equals x question marks. That's a ternary kind of conditional. And then you could say the true statement like, so 1 equals equals x question mark a colon b. That will fail to compile. But if you write x equals equals 1 question mark,
00:32:10
Speaker
a colon b that will succeed at compiling. So the only difference here is 1 equals x versus x equals equals 1. I've just inverted that at the beginning. And that's because question mark can be both a, it can be part of an identifier. So Ruby can't tell whether that's part of a method name or if it's a ternary operator.
00:32:37
Speaker
So if the X is next to the question mark, it could be either. And it's, yeah, that's one question mark isn't a valid identifier. So it's not. So there's things like that, that are really, how could you teach a grammar or a parser to, to, to really recognize the distinction between those two things?
00:32:57
Speaker
Yeah, that doesn't seem like fun. But you must hit that kind of problem on the way out as well, because you've got to be careful not to create rewrite rules that can produce that kind of output. I think that's true. And I wouldn't say that the LST model prohibits you from creating a change that would ultimately result in a convolution error.
00:33:22
Speaker
But rather, and this goes back to the pragmatic thing, how often would it be that I would be happy to do a change method name where the method name, that would result in this kind of scenario. It's not going to be super common. Yeah. Okay. So you're content in those cases to just say, well, get checkout and rewrite the files back that way. That's right. Yeah. That's fair. What about, and this is one of those topics that's
00:33:52
Speaker
kind of unnecessarily controversial, in my opinion, but it is controversial. You're writing out some new code. I mean, you can do identity transformations, but you're writing out some new code, and you want to keep it in the same white space style as the original code.
00:34:08
Speaker
Absolutely, yeah. And so one of the things we do with that, if we go back to how we were defining LST, how we keep all the white space of the original code, we have a whole series of heuristics where we derive what the prevailing style of that project is.
00:34:28
Speaker
you know, based on the white space that we can observe in it. So, is it using tabs? Is it using spaces? How is it doing continuation indenting versus indenting? So, do I do two indentations to the left? So, it's just like when we use like fluent method builders, or is it just one? That can be language specific. Continuation indenting in Kotlin is less
00:34:51
Speaker
common in 2024 than it is in Java. Some of these rules about what should be continuation indented versus not are themselves, I think, not really well specified and are subject to probably IDE bugs over many years where something that should have been continuation indented was only indented or vice versa. But then that rule sticks.
00:35:20
Speaker
Because to change it, to fix that bug, would cause auto formatting a file to change formatting. And so there's this tendency for these intentional or unintentional rules around indentation to stick over the long term.
00:35:36
Speaker
and discovering all that, it's very complicated. But at any rate, there's a utility inside of the visitor called auto format, where you can call auto format just on a subtree that you're modifying, or you could call auto format on the whole file. And so whenever you're inserting new blocks of code, you tend to call auto format on just that subtree that you're inserting, and it massages it to look consistent with the context of the code around it.
00:36:06
Speaker
Okay, so I don't have to worry about white space until right at the end of the piece of syntax I'm about to return. This problem actually goes beyond just white space to also include other forms of style like import formatting. Should I use wildcard imports like a star wildcard import or not? How many types need to be in a package for me to use a wildcard import versus not?
00:36:31
Speaker
is there a different number of types in static imports versus non-static imports to use a wildcard or not? That is also something that we're deriving as we're looking at the existing code. You may get to a situation where you just remove an import. You remove one type because you've changed it to something else. Is that type covered by a wildcard import now?
00:36:59
Speaker
And by removing the last reference to that type in the file, are there now only, you know, n minus one wildcard imports remaining? And therefore I should unfold the wildcard into a series of name type imports. And so that stuff is also kind of rolled up into what it means to add and remove imports at the framework level.
00:37:24
Speaker
And I can see that you get people mad at you if you don't do this. You get really, really hot about these things. That's right. That's right. How are you actually doing that? Are you just sort of heuristically tracking for patterns or something?
00:37:42
Speaker
Absolutely. It's an algorithm that you can implement as a visitor itself. You can look at all the body of code after you've parsed it into LSTs and calculate. We basically count occurrences of different kinds of styles and then try to decide on what the predominant style is.
00:38:06
Speaker
So could you write a rule that said, go through my code base, find the most popular style of instrumentation that we tend to use, and make the entire code base use that universally?

Code Style Consistency and Enforcing Uniformity

00:38:17
Speaker
Absolutely. Yeah. So auto format itself is a recipe that has all of the individual options. So you could derive what the predominant style is, and then use that as input to run an auto format recipe to make the recipe in that way.
00:38:33
Speaker
And that's not really something people start with, actually, which is to just go through and add some level of consistency before they do anything else. Why can I see someone running this as soon as Jeff leaves the company going over it to rewrite all Jeff's code? Because Jeff was rubbish at that stuff. I know, yeah. We didn't have the heart to tell him until he left. That's right.
00:39:00
Speaker
I tend to, I think personally, I tend to avoid this kind of conflict by just blaming someone else. I'll just say, we use whatever the IDE default is. No question.
00:39:14
Speaker
That's the easiest way to onboard new employees, especially if you're working in an open source project, you don't have to have a contributors.markdown or something where you can use this style or that style. If I clone it with a popular ID, it should just work. I don't always love the default styles, but I've long since learned to just accept it.
00:39:37
Speaker
Yeah, me too. I generally don't love the default styles, but I love not arguing about it at all. And that's worth more to me. I like languages these days that ship with an official code formatter that has no options and no flags. That certainly makes it a lot easier.
00:39:55
Speaker
But you don't get that luxury. You have to live in a world where that isn't the case, right? That's right. And, you know, there is... You know, GoLang is one of those, right? GoLang is one of those that has GoFund as part of its kind of core technology stack.
00:40:12
Speaker
I think that is representative of Google culture, which is much more control-oriented. Like, you know, Monorepo, Google Java format or Gofumpt, like one style to rule them all. There are some good things about that. You kind of refactor more quickly.
00:40:33
Speaker
But it also has consequences. There's a higher degree of code review. In some ways, things don't move as fast, but there's a lower degree of variability in that code base as well.
00:40:53
Speaker
Yeah, I think you have to choose your battles, which metrics are worth having flexibility over, and which might you just delegate to a computer and not worry about anymore. That's very true. Yeah, that's very true. Okay, well, why don't we get in a little bit, as we're veering there anyway, get back into user space a bit and talk about, as a user of a rewrite tool, like OpenReright,
00:41:19
Speaker
What kind of things can it do for me? We've talked about reformatting and how hard is it for me to extend it with my own in-house special roles? Yeah. And so there's, what can it do for me? I think, I think there's, this is this system of, of refactoring, as I'm describing it, this is a rule-based refactoring system. Um, this is very much a, um,
00:41:48
Speaker
encapsulation play or belief in encapsulation. That if I provide a base recipe like change method name, that it will be quicker to write common API changes on top of that base. And a recognition that I think that the
00:42:10
Speaker
The third party and open source ecosystem upon which we rely is itself subject to a lot of encapsulation. So I think about Spring Boot, for example, a really common Java framework. Spring Boot has its own Spring Boot code, but it's also built on top of Spring Framework, which is in turn built on top of all hundreds of thousands of other open source projects.
00:42:34
Speaker
And so, you know, if you take one of those, like the unit testing framework JUnit, you know, spring testing is based on JUnit. So if I solve the problem of moving from JUnit 4 to 5, if I could encapsulate that as a recipe, then it becomes easier for me to write a recipe that moves from, say, Spring Boot 2 to Spring Boot 3.
00:43:01
Speaker
It really requires a level of participation up and down the language stack from the lowest level library all the way to the big frameworks to unlock my ability as a user of one of those frameworks to move more quickly between versions.
00:43:21
Speaker
So you really are going to that level of ambition where you'd replace the whole version of a framework and all its subdependencies. Absolutely. I think my view of the world or the world I want to see is that framework authors are responsible when they make breaking changes for providing the recipe that fixes their downstream consumer. In other words, you break it, you fix it.
00:43:43
Speaker
Right now, the unit economics of this change are really backwards. I used to work for the spring team. If I make a breaking change, that impacts 20 million developers downstream of me. Or should I have taken the time at the time that I made the change to also provide the recipe? I think it's very, very, very important that such a technology be
00:44:07
Speaker
permissively open source? Because only if the technology is truly Apache licensed or something that's very permissive, that allows others to build commercial products on top of it, can we expect the community to really participate in writing recipes?
00:44:27
Speaker
Yeah, and then we might hit the dream where languages are a bit less afraid to deprecate old craft from the earlier versions. That just as a whole world we're able to kind of move more quickly towards the outcome we want. We're not kind of dragged down by
00:44:44
Speaker
the technical data of 10 years ago. But one of the blockers to that I would have thought will be the predominant, like you can write transformations for different languages and you support quite a lot on open rewrite, but you're kind of pushed to writing those rewrite rules in Java.
00:45:07
Speaker
And that's why it's very important, I think, that we do the two-phase language support, that you're able to write recipes in the language that you're trying to transform. And so the first example, you'll see how that is with Ruby coming out here, where you can write a Ruby recipe to transform Ruby. It's very important, I think, to provide a native recipe authorship experience in the language you're trying to transform.
00:45:31
Speaker
Do you think that will tend to be the job of the open rewrite maintainers? Or do you think it will eventually be something that language maintainers themselves support? I think if I create Chris Lang, would I then be expected to write a Chris Lang parser and printer and visitor?
00:45:56
Speaker
It's a good question. I think, you know, even below the language level at the framework level, I think what I want to see maybe, I believe a little bit in competition here, that I'm thinking of, like we've mentioned, Spring, but there's also competitors to Spring in the Java ecosystem. Red Hat has Korkas, Oracle has Micronaut. Those are competing for a market share with Spring, our developer mind share. Like, what should I build my application in?
00:46:23
Speaker
And if, say, and so as an, you know, kind of call out, I think Oracle for the last couple of years has on every breaking change, Micronaut has provided a recipe that fixes that, that breaking change. So Micronaut was really leading the way on that experience of you can move easily between major versions of Micronaut. That should be a reason why I consider Micronaut as a developer over Spark.
00:46:49
Speaker
And if I opt out of that as a framework author, then it's just one of the decision points that I'm making as a user.
00:47:02
Speaker
Yeah, so you're seeing it as a competitive advantage if you do choose to get involved. Absolutely. Now, imagine from a vendor's perspective, suppose I'm starting a new application monitoring platform and I'm trying to compete with Datadog. I know that I have to go through and replace Datadog specific
00:47:26
Speaker
like vendor lock-in source code with mine, if I provide those recipes, then I expand in my customer base much more quickly than I would have otherwise.
00:47:38
Speaker
Yeah, and when one of your competitors goes bust, the first person to write those rules will get the lion's share of the people looking for a replacement, right? I mean, this is possibly pushing it too far, but how ambitious do you think you could get if someone said, I've got this Ruby on Rails site and I'd like to migrate it to Django on Python? Do you think you'd ever get that far or is that pushing the boat too far?
00:48:04
Speaker
My tendency is to think it's too far, but then again, I change over time in this too. Broadly, I say there's creative activity and software development, and then there's really mechanical activity. While it's hard to precisely define what's what, we feel it when it's mechanical.
00:48:22
Speaker
I used to, for a long time, I watched, you know, baseball in the United States at St. Louis, watch St. Louis Cardinals. And they have 162 games a year. And I used to reserve maintenance activity for Cardinals baseball games, because baseball isn't quite interesting enough to consume my attention. But neither is mechanical activity and code. So I would do the same, do them at the same time, kind of like. And, you know, but
00:48:50
Speaker
Do you consider rails to Django migration to be creative or mechanical? Sometimes it can be more mechanical than we expect at the beginning. Yeah. I suppose if you're using the vanilla stuff rather than some particular funky... I bet libraries. I bet different libraries is where you fall down.
00:49:12
Speaker
It'll all just be a matter of diminishing return, right? Where is that point of diminishing return for me? Yes. At what point would you be better off just doing it creatively rather than trying to come up with a rule that can do it mechanically? That's right. Yeah, I can see that. Okay, so if we're getting into the land of multiple languages, does the game change when you start to try and apply this tool to whole projects or even whole company's worth of projects?
00:49:43
Speaker
Explain what you mean by

Scalability of Open Rewrite for Large Codebases

00:49:45
Speaker
that. I'm just thinking, so if you wanted to call Open Rewrite on Google or Facebook's Mono repo, now you're dealing with a vast number of languages and a vast code base. Can you scale to that kind of size? Are you keeping the size of the code base small and focused?
00:50:05
Speaker
I think, so scale is certainly one of the points of like, where as a company in Madarin, we've taken this LST, this lossless semantic tree, which in the open source tools we're producing in memory. And we've taken this extra step to figure out what it means to serialize into disk.
00:50:31
Speaker
Which, its health is pretty tough, a problem, because these trees are actually cyclic in nature, especially the type attribute and information. There's a lot of work to do in cutting cycles and reconstituting them as cycles on the other end. This could sort of work.
00:50:49
Speaker
But once you have a LST on disk, that's the sort of like fundamental block upon which you can build a horizontally scalable system to operate on hundreds of millions of lines of code. And so I think when you're running an open-reary recipe on one repo, the cost and time is dominated by the parsing, not the recipe running.
00:51:16
Speaker
And so you can still achieve a great degree of value, you parse, you run recipe, you made the change, great, move on. But then when you want to do that for the other 10,000 repositories, you wind up sort of varying that cost more and more. If the recipe is really, really well battle tested, think of like the Java 8 to 17 recipe at this point.
00:51:41
Speaker
It's still arguably okay to bear that cost, that like parsing cost once, no matter how many projects you have. And so when we heard Adam Zalpski, the AWS CEO, get up on stage and say that Amazon migrated 1,000 Java applications from Java 8 to 17 in its own code base in two days,
00:52:05
Speaker
He was doing that with open source, but they were able to do that because that recipe is so well battle tested. If you were developing a new recipe that was making some sort, you would want to be able to iterate on it many times over again. And in that case, the cost of parsing really, really, really adds up and starts to be an impediment. So what do you do? Do you attach your LST to the hash of the source file or something?
00:52:35
Speaker
That's definitely, absolutely. The goal would be clone dozens or hundreds of thousands of projects, build the LST, and keep that as an artifact.
00:52:46
Speaker
as both permanently in the artifact store and locally where you've got these repositories cloned. And then you can do a recursive operation on a whole directory structure, running a recipe on every repository that's found in that directory structure. Those kinds of experiences where you want to be able to operate on a large code base.
00:53:10
Speaker
OK, what about something less invasive? Like, I'm assuming I can use the same system for code linting.
00:53:18
Speaker
Well, I don't want to change anything. I just want to be warned of patterns. Absolutely. Yeah. So this is a discovery, I think. Remember, we talked about markers a while back, and we were originally using markers to put information about what Java version, what dependencies existed, those kinds of things on the LST. But because a marker can exist anywhere,
00:53:44
Speaker
Madarin's co-founder Olga Kunzich a couple of years ago pioneered the idea of what if we just used a special kind of marker to indicate where something was found as a search result.
00:53:58
Speaker
And so we do have an interface called search result, which is an extension of the marker. And a refactoring recipe, rather than making a change to the LST model, can just add a search result marker to any LST element. And then we have a choice of how to render that search result marker when we go to print out the LST. So we can print it as a little comment with an arrow pointing in a particular place or
00:54:21
Speaker
You can render it any way that you want to actualize it in the source code, but that unlocked the search capability. Curiously, search wound up being a special case of transformation.
00:54:37
Speaker
Oh yeah, I can see that. Which is not what we expected at the beginning. A presumably quite sophisticated search because you've got all the type information there too. Exactly. So one of the really interesting examples is fine uses of deprecated methods.
00:54:51
Speaker
where we're able to look at every method invocation in a body of source code and tell whether that invocation is deprecated at the time that it was compiled. There's nothing in the source code itself that would indicate to you that something's deprecated. It's something that's n-levels up to type attribution information.
00:55:11
Speaker
Yeah, yeah. Okay, that makes sense. So we start to take that one step further. And this was so the marker search result marker was the first kind of like foray into search. The second foray was was to that recipes can now define what's called data tables. And this is an open source as well. And data tables are just columnar data, like it's a columnar format, like,
00:55:36
Speaker
If you were just to admit a table, what would the column names be and what are the data elements on them? And then a recipe, as it's going through its normal visitor, it's like working through the code base. It can insert rows of that column structure into a table.
00:55:55
Speaker
And then at the end of the recipe, we can just actualize that table in an Excel file or a CSV or whatever column or output that we want to achieve. And then if we do that across hundreds of repositories, we're actually collecting data across potentially the entire code base. So we use this to do things like
00:56:18
Speaker
find sensitive API endpoints that produce PII. Whether or not that PII is visible at the top level return or it's buried somewhere in the model object that you're returning. Find vulnerable dependencies, find usages of particular APIs. And one of the columns in the case of find methods in the find methods data table is the actual code snippet of the method column.
00:56:46
Speaker
So if you're doing an impact analysis, I'm thinking of changing this API. You can start by just finding all the usages of the API and you quickly get an Excel table, all the like variations of the way that the API is used across the code. Yeah. Oh, there's been a security warning on some method in some dependent library. Do we use it? Do we use it anywhere important? Yeah, that's right. Yeah. That's quite cool. Okay. So.
00:57:13
Speaker
I should probably, for time, I should push on to my two last questions, which is, the first one is, let me think of an out of band language, Gleam.

Supporting New Languages in Open Rewrite

00:57:25
Speaker
I'm going to pick the language Gleam. And I decide I would love to have this capability for Gleam.
00:57:31
Speaker
Should I take inspiration from you and hook into the parser and write my own lossless tree and a printer, which sounds like a bunch of work, but not desperately, or should I just try and support Gleam in Open Rewrite, which you think will be the path of least resistance?
00:57:57
Speaker
I think that there are some, there's some common support in open rewrite that would help you make progress. The, this question, I think partially depends on whether, you know, how hard to interop between the open rewrite sort of like Java-based infrastructure and, you know, Gleam's compiler.
00:58:27
Speaker
really is. Right. In this way, I was actually pretty concerned for a while. How are we going to support a language like C-sharp? Which there's not really an obvious way to run C-sharp code on Java. I can run Ruby code on Java easily. I can run Python code to some extent. Java's because it's Java Ruby and J-Python, yeah. But C-sharp, what am I going to do? I think lately,
00:58:57
Speaker
There's been some really amazing work. There's a project out there called Chicory. It's dlibso, or d-y-l-i-b-s-o, dilibso, slash chicory on GitHub. And Chicory basically enables any Wasm compatible binary to run on JVM.
00:59:19
Speaker
That's unlocked a world of opportunity, I think, for C-sharp, for C++, for Swift, for those languages to build parsers. We would build a parser in Swift, and potentially even build a recipe in Swift, but then that can execute inside the context of the open-write recipe scheduler.
00:59:44
Speaker
So that, I think, is what we're going to see here in 2024 as we expand into C-sharp. And that will provide the template that others will get to use as well. Yeah, there's a use case for Wasm I wouldn't have foreseen when they announced the project. Me too. I've been deriding it as Corba 2 for years, and I might have to put my words on this. Well, it's good to be humble occasionally, by the way.
01:00:15
Speaker
OK, so if I don't want to go to all that work, I am going to get myself a JVM and download OpenReright. Where should I start with it?
01:00:25
Speaker
Yeah, absolutely. So on GitHub, the organization is open rewrite, and there's a bunch of different repositories inside of that. Open rewrite slash rewrite is where most of the stable language parsers are. But we have modules like recipe modules, rewrite spring, rewrite logging frameworks, et cetera, that are community maintained for various applications.
01:00:52
Speaker
OK, cool. And so I'm not actually, I'm pretty sure I'm not going to go to this amount of work. I will just leave you as a final, unsubtle hint. It is possible to call Haskell code from Java. Is it really? Yeah, yeah. A company I worked for worked on a bridging project for it. I think you can go in both directions if memory serves. What's that project called? Oh.
01:01:20
Speaker
I will have to find it and send it to you and put it in the show notes. Okay. Very good. Yeah. In case you're tempted down the line to add Haskell support.

Getting Started with Open Rewrite

01:01:30
Speaker
Cool. Right. I think I should go and at least try and lint some code, if not rewrite it entirely. Jonathan, thank you very much for joining us. Yeah. Thank you. It was a pleasure.
01:01:42
Speaker
Thank you, Jonathan. And I have to correct that last thing I said there. I double checked. There are actually two libraries, one for calling Haskell from Java and one for the other way around. Link to both in the show notes if you want them, along with the more relevant and pressing link to open rewrite if you want to give that a try. It has got support for an impressive list of different file types and languages.
01:02:06
Speaker
And when you think about the amount of work that takes, I think it's worth checking if one of your languages or files is in that list. It might well have a place in your toolbox. Before you go and do that, if you've enjoyed this episode, please let us know. Give it a like if you've liked it. Share it with someone if you know someone who might enjoy it. And make sure you're subscribed because we'll be back next week with another interesting voice in the world of software development.
01:02:31
Speaker
Until then, I've been your host, Chris Jenkins. This has been Developer Voices with Jonathan Schneider. Thanks for listening.