• 2 Posts
  • 12 Comments
Joined 2 years ago
cake
Cake day: July 13th, 2023

help-circle
  • It re consumes its own bullshit, and the bullshit it does print is the bullshit it also fed itself, its not lying about that. Of course, it is also always re consuming the initial prompt too so the end bullshit isn’t necessarily quite as far removed from the question as the length would indicate.

    Where it gets deceptive is when it knows an answer to the problem, but it constructs some bullshit for the purpose of making you believe that it solved the problem on its own. The only way to tell the difference is to ask it something simpler that it doesn’t know the answer to, and watch it bullshit in circles or to an incorrect answer.




  • I think they worked specifically on cheating the benchmarks, though. As well as popular puzzles like pre existing variants of the river crossing - it is a very large puzzle category, very popular, if the river crossing puzzle is not on the list I don’t know what would be.

    Keep in mind that they are also true believers, too - they think that if they cram enough little pieces of logical reasoning, taken from puzzles, into the AI, then they will get robot god that will actually start coming up with new shit.

    I very much doubt that there’s some general reasoning performance improvement that results in these older puzzle variants getting solved, while new ones that aren’t particularly more difficult, fail.


  • Did you use any of that kind of notation in the prompt? Or did some poor squadron of task workers write out a few thousand examples of this notation for river crossing problems in an attempt to give it an internal structure?

    I didn’t use any notation in the prompt, but gemini 2.5 pro seem to always represent state of the problem after every step in some way. When asked if it does anything with it says it is “very important”, so it may be that there’s some huge invisible prompt that says its very important to do this.

    It also mentioned N cannibals and M missionaries.

    My theory is that they wrote a bunch of little scripts that generate puzzles and solutions in that format. Since river crossing is one of the top most popular puzzles, it would be on the list (and N cannibals M missionaries is easy to generate variants of), although their main focus would have been the puzzles in the benchmarks that they are trying to cheat.

    edit: here’s one of the logs:

    https://pastebin.com/GKy8BTYD

    Basically it keeps on trying to brute force the problem. It gets first 2 moves correct, but in a stopped clock style manner - if there’s 2 people and 1 boat they both take the boat, if there’s 2 people and >=2 boats, then each of them takes a boat.

    It keeps doing the same shit until eventually its state tracking fails, or its reading of the state fails, and then it outputs the failure as a solution. Sometimes it deems it impossible:

    https://pastebin.com/Li9quqqd

    All tests done with gemini 2.5 pro, I can post links if you need them but links don’t include their “thinking” log and I also suspect that if >N people come through a link they just look at it. Nobody really shares botshit unless its funny or stupid. A lot of people independently asking the same problem, that would often happen if there’s a new homework question so they can’t use that as a signal so easily.


  • Yeah I think the best examples are everyday problems that people solve all the time but don’t explicitly write out solutions step by step for, or not in the puzzle-answer form.

    It’s not even a novel problem at all, I’m sure there’s even a plenty of descriptions of solutions to it as part of stories and such. Just not as “logical puzzles” due to triviality.

    What really annoys me is when they claim high performance on benchmarks consisting of fairly difficult problems. This is basically fraud, since they know full well it is still entirely “knowledge” reliant, and even take steps to augment it with generated problems and solutions.

    I guess the big sell is that it could use bits and pieces of logic gleaned from other solutions to solve a “new” problem. Except it can not.





  • I just describe it as “computer scientology, nowhere near as successful as the original”.

    The other thing is that he’s a Thiel project, different but not any more sane than Curtis Yarvin aka Moldbug. So if they heard of moldbug’s political theories (which increasingly many people heard about because of, well, them being enacted) it’s easy to give a general picture of total fucking insanity funded by thiel money. It doesn’t really matter what the particular insanity is, and it matters even less now as the AGI shit hit mainstream entirely bypassing anything Yudkowsky had to say on the subject.


  • Not really. Here’s the chain-of-word-vomit that led to the answers:

    https://pastebin.com/HQUExXkX

    Note that in “its impossible” answer it correctly echoes that you can take one other item with you, and does not bring the duck back (while the old overfitted gpt4 obsessively brought items back), while in the duck + 3 vegetables variant, it has a correct answer in the wordvomit, but not being an AI enthusiast it can’t actually choose the correct answer (a problem shared with the monkeys on typewriters).

    I’d say it clearly isn’t ignoring the prompt or differences from the original river crossings. It just can’t actually reason, and the problem requires a modicum of reasoning, much as unloading groceries from a car does.


  • Yeah, exactly. There’s no trick to it at all, unlike the original puzzle.

    I also tested OpenAI’s offerings a few months back with similarly nonsensical results: https://awful.systems/post/1769506

    All-vegetables no duck variant is solved correctly now, but I doubt it is due to improved reasoning as such, I think they may have augmented the training data with some variants of the river crossing. The river crossing is one of the top most known puzzles, and various people have been posting hilarious bot failures with variants of it. So it wouldn’t be unexpected that their training data augmentation has river crossing variants.

    Of course, there’s very many ways in which the puzzle can be modified, and their augmentation would only cover obvious stuff like variation on what items can be left with what items or spots on the boat.



  • Other thing to add to this is that there’s just one or two people in the train providing service for hundreds of other people or millions of dollars worth of goods. Automating those people away is simply not economical, not even in terms of the headcount replaced vs headcount that has to be hired to maintain the automation software and hardware.

    Unless you’re a techbro, who deeply resents labor, someone who would rather hire 10 software engineers than 1 train driver.