Hacker News new | past | comments | ask | show | jobs | submit login
You can use C-Reduce for any language (bernsteinbear.com)
300 points by Tomte 11 hours ago | hide | past | favorite | 62 comments





Well, aren't you going to share the reduced file?

Okay fine, let's see for ourselves:

    # Setup
    git clone [email protected]:RustPython/RustPython.git && cd RustPython && cargo build --release
    git clone [email protected]:tekknolagi/scrapscript.git && cp scrapscript/scrapscript.py .

    # Download interesting.sh, replace the path to RustPython
    chmod +x interesting.sh

    # Command in article:
    nix run nixpkgs#creduce -- --not-c interesting.sh scrapscript.py

Niiice:

    (93.2 %, 13717 bytes)
    (93.2 %, 13683 bytes)
    (93.3 %, 13614 bytes)
    (93.3 %, 13592 bytes)
    (93.3 %, 13571 bytes)
    (93.3 %, 13517 bytes)
    (93.3 %, 13449 bytes)
    (93.4 %, 13412 bytes)
    (93.4 %, 13365 bytes)
    (93.4 %, 13333 bytes)
    (93.4 %, 13313 bytes)

It seems to have stopped at "(96.4 %, 7347 bytes)" with the following output: https://gist.github.com/judofyr/47cba8a20cb2cd5798943ef975d0...

Random thought. Another commenter worried about the runtime of the program becoming mangled and performing destructive operations on your machine. What if you run the reducer as a source-to-source Nix derivation? Protects against dangerous things, and can be easily distributed to remote builders.

Note that as recommended by John Regehr, author of C-Reduce, for this use case you might also want to try Shrinkray, a tool that was written to be format independent and works well for cases that C-Reduce dow not: https://mastodon.social/@regehr/113489759789563570

Is this[0] the official repo of Shrinkray?

[0]: https://github.com/DRMacIver/shrinkray


Interesting, isn't that also the author of Python's hypothesis?

Yes, hypothesis also does minimization of failed test cases, so it's kind of a similar problem, just a question if what format the data is in and how you invoke your test.

Yep!

Looking forward to try it! A shout out to cvise (https://github.com/marxin/cvise) as a python alternative that works well too for non c languages.

A paper by the authors John Regehr et al. from 2012 explaining how it works https://fsl.cs.illinois.edu/publications/regehr-chen-cuoq-ei....

I read this paper and I still feel lost as to how can this even be possible. It seems to understand how to tokenize, merge lines, remove tokens etc for arbitrary programming languages. Is there another paper that explains this algorithm alone?

I was wondering the same thing but I guess the key is probably that you don't actually have to do it correctly every time. Just tokenizing based on a few common characteristics (brace pairing, quotes, indentation, newlines, etc.) should let you trim a ton without knowing anything about the language, I imagine?

My real worry is what if this ends up running dangerous code. Like what if you have a disabled line that writes instead of reading, that randomly gets reactivated?


It basically runs transformation passes over the code until they stop changing the code or break its behaviour.

It seems like a small number of the passes are not specific to C:

https://github.com/csmith-project/creduce/blob/master/creduc...

`"C" => 1,` means it is only for C.

I would guess `pass_lines` is the most important for non-C code; I'm guessing (it's written in unreadable Perl) that it removes lines.

So while it can work for languages other than C, most of its features are C-specific so it's not going to work nearly as well. Still I'd never heard of C-Reduce; pretty cool tool!

Someone make one based on Tree Sitter, quick!



This paper is not about C-Reduce as a whole but 3 new "domain-specific test-case reducers" that the authors added to the project.

Most of the non-domain specific reductions in C-Reduce is simply brute force IIRC.


Delta debugging is not new: https://en.wikipedia.org/wiki/Delta_debugging

My implementation of delta debugging, "delta" is 19+ years old: https://github.com/dsw/delta

I released it as Open Source because Microsoft Research sent someone to my office to ask me to, back when Microsoft was calling Open Source a "cancer".

The LLVM introduction by Latner refers to the "standard delta debugging tool", so it is rather well-known: https://aosabook.org/en/v1/llvm.html 'unlike the standard "delta" command line tool.'


For other readers' benefit: C-Reduce is a little more sophisticated than plain delta-debugging. From the abstract of Test-Case Reduction for C Compiler Bugs (2012):

> [...] [C-Reduce] produces outputs that are, on average, more than 25 times smaller than those produced by our other reducers or by the existing reducer that is most commonly used by compiler developers. We conclude that effective program reduction requires more than straightforward delta debugging.

(Of course, this means that C-Reduce is 12 years old now.)

At the same time, C-Reduce seems to be more general than the LLVM tool you linked ("BugPoint", dating to 2002), since that one works with LLVM IR specifically.

I think most developers are generally unfamiliar with automatic test case minimization tools and techniques, so the post may be helpful even if the ideas have been known in their respective circles for quite some time.


I did not know about C-Reduce until just now, and I'm already hooked. This is basically like discovering git bisect for the first time.

I'll have to stuff that into the back of my mind until I run into something where it might fit.


> I did not know about C-Reduce until just now, and I'm already hooked. This is basically like discovering git bisect for the first time.

I've known about git bisect for maybe a decade, and used it maybe... twice so far? And I think at least one of those times it took me longer to understand the command line options than it took to just do it manually.

YMMV, maybe it's more useful in a larger collaborative codebase, but to me it sounds life-changing in theory but I haven't found it that in practice.


At a previous job, with a huge C++ codebase and ~100 developers over 20 years, many platforms supported, a build could take hours, and the test suite took so long to run that it would take several days or weeks of real time to get through it all.

This cycle time combined with occasional unexpected interactions between components meant that in every release cycle, there were dozens of complicated failing tests where it was not obvious which code change was responsible.

`bisect` here was extremely helpful: instead of having to pore over commit history and think very hard, I could bisect with a small wrapper script that would build and run the failing test in question. Builds still took hours, but I could usually autonatically pinpoint the responsible code change for one of the tests overnight.

(This was not using Git, but Perforce, for which I had to write `p4-bisect`. Alas, it's not open-source...)


It's pretty incredible if I'm trying to find a breakage in a repo I don't understand. Build's failing? Easy enough:

    git bisect start master v1.1         # Bisect from v1.1 to current master
    git bisect run cmake --build .build  # Find the first commit where `cmake --build` returns non-zero.
No need to bother trying to understand the error message, I can go make coffee instead and then just look at the commit that broke it. Also, this is really useful for and in conjunction with Nixpkgs.

It's not as useful for personal projects because chances are it won't tell you anything you don't already know.


Doing this manually was part of my first job out of college on a C/C++ compiler team. Kind of amazing that there is automation that can accomplish the same thing!

I ran into some issues with what might be compiler bugs in cc65, a C compiler for the 6502 processor (c64, NES, Apple 1).

I might see if I can get this set up. Apparently VICE supports "printing" to a file on the host OS, so I could run my test on the emulator.


It's great when combined with random test input generators.

Yes, most property based testing libraries do this.

Hypothesis in particular has a very clever way to do test reduction. The key problem is that if your test generator is enforcing some constraint on the input, it's not necessarily the case that naive pruning will preserve the constraint. Hypothesis has a fairly general way of enforcing that the generated test input continues to satisfy constraints.

Same story, I can't wait to try this.

...so I just discovered git bisect for the first time and this is pretty cool.

Yes! The actual commands that you have to search up to get it to run automatically without user input, take almost as long to put together as finding the bug that you're finding (in the 80% case, of course there are 20% cases that take forever) but there is something so satisfying about seeing the thing just humming along once you have it set, that just makes it so satisfying.

> Yes! The actual commands that you have to search up to get it to run automatically without user input, take almost as long to put together as finding the bug that you're finding

Funny, I just wrote the exact same thing only to scroll down and see you had the same opinion:

https:///item?id=42262579


Found this article with examples of before and after: https://pramodkumbhar.com/2024/01/c-reduce-systematically-ta...

But still having trouble understanding how it knows WHAT to remove when trying each iteration. There must be some tokenization going on, but then I don't know how that would work across programming languages


Creduce is awesome.

I used a test script that spent hours using CSmith to generate random test programs when developing an esoteric llvm target. When they crashed it automatically Creduced and left them for inspection. Invaluable!


Works well for SQL too, I‘ve been using it at my day job, found out via https://github.com/sqlancer/sqlancer?tab=readme-ov-file#redu...

Without an explanation of why it works on languages other than C, that is a hard claim to believe! I mean, I don't think they're lying but (given it doesn't use an LLM) I am bewildered.

So the short answer is that some of the reduction passes are pretty generalizable to C-family languages, and these can be some of the most effective passes.

Some of those passes are:

* Tokenize the input according to C, and randomly chuck away chunks of length 1-(I think) 13. Most languages vaguely similar to C have very similar tokenization rules to C, so these kinds of passes are going to have similar effects, and this will tend to be effective in doing things like omitting needless qualifiers or attributes.

* Balanced parentheses, and chuck away units of balanced parentheses. This includes (), {}, and []--and for almost any language, this can be a useful level of stuff to throw away or reduce.

* Stripping comments and whitespace. Again, many languages use the /* and // styles of comments that C does, so this is pretty effective.

There's actually relatively few passes that are incredibly C/C++-specific, and in my experience, one of the biggest issues with creduce is that it is absolutely terrible at trying to detemplatize the code as a reduction step, which should be a relatively easily automatible step.


I recommend skimming the PLDI paper linked by asmeurer, it has a good summary. Some of its transforms are pretty C-specific, using the Clang frontend; some are pretty generic and probably work on any Algol-descended language. It's meant to be a modular tool, so you could add transforms that understand other languages if you like.

Without looking at the paper, I would guess it would be like a fuzzer that apply mutations that reduce the size of the input.

Isn't that potentially unsafe though? Random permutations can reduce arbitrary programs to destructive programs, in particular any program can be mutated to `rm -rf /` given enough time. Also even the problem of finding syntactically valid programs is combinatoric, it's pretty surprising that it can go toward a local optima in such a short time without having any understanding of the function or its derivatives.

For compiled languages it should be fine, as you're only going to compile the permuted source code, not execute it.

Given a sufficiently bad compiler bug anything is possible, but I think given that it's trying to minimize the size of an input that gives a particular output I don't think it's likely to explore distant branches.


> For compiled languages it should be fine, as you're only going to compile the permuted source code, not execute it.

Only if you'te going for compiler bug. If you're working on minimal reproducible example. You need to make sure your code isn't reduced to:

int main() { return 0; }

in that case.


It runs the executed code in order to determine if the bug exists, does it not?

That depends completely on the interesting-ness test that's provided. If you're looking for a compiler bug (like I do often for my language), then the interesting-ness test checks the compile logs for information like the "Segmentation Fault" text, no need to run the actual executable. You could also hoist everything into docker if you really want to, or ship it off to another device to run.

> Given a sufficiently bad compiler bug anything is possible, ...

I can't help but wonder about the consteval/constexpr machinery in the various C++ compilers... It'd be interesting to see how much adversarial input has been tested for those.

(I guess Zig's comptime might also be relevant, but I'm not sure what that's allowed to do. Maybe execute arbitrary code?)

... anyway just a stray thought.


It seems pretty unlikely to mutate in a malicious direction, random chance probably wouldn't get there, and there doesn't seem like any reason it would be guided in that direction.

"Enough time" does a lot of work here and warrants further analysis. With enough time it might produce the works of shakespare (if you ignore its designed to minimize), but we should probably view anything that would take more than a year as too much times.


Use a sandboxed VM?

Yes, and without understanding how it works, I'm left wondering whether it's even safe to use this way. Will creduce execute mutated versions of the input script, potentially deleting my files or eating my lunch?

it will execute whatever script you pass it (immutable) and mutate whatever other files you pass it

I think the question is, how do you know C-reduce isn't going to mutate your test case into calling `remove("/path/to/my/important/file");`

C-reduce is meant to... Reduce your files, it would not add stuff that was not there in the first place. Also, I think it's meant to only be run against the "frontend" of most languages, not full execution

Yeah but you could have a shell script that does

    rm -rf /foo/bar
and in theory it could reduce it to

    rm -rf /
The passes are very heuristic, so that doesn't seem like something you can rule out.

(and I actually want to run it on shell scripts! )


While I've used c-reduce before, I've never done it in a way where it could be destructive. However speculating based on what I do know. I think the 2 things I would do would be in the interesting-ness test, grep for already known harmful instructions and force them to be uninteresting (returning 1). And then if I was still unsure of the potential harm of the program and I had to run it to determine if it's interesting or not (segfault or something similar). I think I would hoist the binary/script into a docker container and run it there. That way the log and result can still be checked, but it's access to actual file system is minimized as much as possible.

TLDR; C-Reduce just gives you text to run/compile, if you're worried about things like that sandbox as much as possible.


So just run it inside of a Docker instance.

> I mean, I don't think they're lying but (given it doesn't use an LLM) I am bewildered.

I’m pretty sure (based on the last time I saw this) that this is just good old fashioned computer science.[1]

I really hope that HN hasn’t forgotten about good old fashioned computer science stuff already.

[1] All the algorithm and whatnot stuff, not the spooky machine learning stuff. Even things like Prolog-for-AI, although that has the slight downside of not working (for the purposes of making AI).


To be clear my comment was meant to be an awareness that it is good old fashioned computer science. Without LLMs, which this predates, it is surprising to me that you'd have a lot of success reducing a program in an arbitrary language and still having something that's valid syntax!

I do get that it will reject a lot of stuff as not working (and has to even in the target language) and I also get that algol-like languages are all very similar, but I am still surprised that it works well enough to function on ~arbitrary languages.

These are very LLM-era properties for a program to have. The question is not "does it work for language x" but "how well does it work for language x", and the answer is not "is it one of the languages it was designed for" but instead "idunno try it out and see".


Yeah, I would have liked to see more examples and an explanation.

I glanced at the code. It seems to have multiple possible "passes" which reduce the code in various ways, and the passes here not tagged with "C"=>1 are used in the not-c mode recommended in the post.

https://github.com/csmith-project/creduce/blob/31e855e290970...

You can see the pass implementation in files on the left.


The key thing is that the transforms it makes aren’t required to always produce a program that can actually run, or even build.

The user provides a script which determines if a program is “interesting.” A program with build-time errors should be considered “not interesting” in most cases. (And if you’re hunting an incorrectly reported error, you’ll want to make sure you only consider that error to be interesting.)

Then it yoinks out parts of your program and checks if the result is still interesting. If it’s not, that attempt is discarded and it tries something else. If the reduced program is still interesting, it will try yoinking out more stuff. Repeat until you like the result.

There doesn’t need to be any understanding of the program in order for this to work. You just need something where removing some bit of the program has a chance of producing a new program that’s still interesting. That works in most programming languages.

This process can take a long time, and it’s slower when there’s a lower probability of a removal producing an interesting program. Thus heuristics are added: don’t remove random character ranges, but work at a token level. When removing a brace, find the matching brace and remove the insides. That sort of thing. Most modern languages are similar enough to C that many of these rules are helpful for them too. And even if they don’t help, this just slows the process down, but it still works.


True, I've used it for asm, though these days I prefer c-vise.


I wrote a general purpose delta debugger in python - https://github.com/andrewchambers/ddmin-python. Can be used to do the same thing.

A good companion/ successor is cvise.

I've used c-reduce/cvise to reduce assembly programs to the minimal set that evokes an assembler defect quite a few times. It's a real handy program.

[1] https://github.com/marxin/cvise


the seems somewhat interesting but the example given is incredibly stupid.

Thank you for your feedback. I'll take that into consideration for my next blog post and compiler.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: