Hacker News new | past | comments | ask | show | jobs | submit login

Well, to be perfectly honest, it's hard question for an LLM that reasons in tokens and not letters. Reminds me of that classic test that kids easily pass and grownups utterly fail. The test looks like this: continue a sequence:

  0 - 1
  5 - 0
  6 - 1
  7 - 0
  8 - 2
  9 - ?
Grownups try to find a pattern in the numbers, different types of series, progressions, etc. The correct answer is 1 because it's the number of circles in the graphical image of the number "9".





These tests always make me wonder: What qualifies as a valid pattern rule?

For example, why wouldn't "0" be a correct answer here (rule being "every other number on the right should be 0, other numbers do not have a pattern")?


I think it's better phrased as "find the best rule", with a tacit understanding that people mostly agree on what makes a rule decent vs. terrible (maybe not on what makes one great) and a tacit promise that the sequence presented has at least one decent rule and does not have multiple.

A rule being "good" is largely about simplicity, which is also essentially the trick that deep learning uses to escape no-free-lunch theorems.


Exactly, it's completely arbitrary. I like to frame it in terms of fitting n points (the existing data in the problem) to a polynomial of degree n+1, where there's an infinite number of ways to pick a solution and still satisfy the initial data.

Maybe the "solution with the lowest Kolmogorov complexity".

In a sibling comment, I replied that usually a repeating pattern can also be applied, but that one usually requires storing the n-sequence, rarely making it the shortest encodable rule.


This is exactly what I hated about these kind of questions!!

Or just, the pattern simply repeats for the n next number.. like come on, there is no objective metric for which rule is "better".


Well, I had to try just for fun of it. The immediate reply was the following:

[...]

Starting with 0 paired with 1. If I think about functions or some formula that could map these values, but that might be too complicated.

Alternatively, perhaps these numbers represent something specific. For example, maybe they're representing the number of certain features in the number itself, like the number of closed loops in the digit's shape.

Let's think about that:

0 has one closed loop. 5 has no closed loops. 6 has one closed loop. 7 has no closed loops. 8 has two closed loops. 9 has one closed loop. Wait a minute, that seems to match the sequence:

0 (1 loop) - 1

5 (0 loops) - 0

6 (1 loop) - 1

7 (0 loops) - 0

8 (2 loops) - 2

So, if 9 has one closed loop, then it should be paired with 1.

But let's double-check this theory with the given pairs.

0 has one loop, paired with 1 – matches.

5 has no loops, paired with 0 – matches.

6 has one loop, paired with 1 – matches.

7 has no loops, paired with 0 – matches.

8 has two loops, paired with 2 – matches.

Yes, this seems consistent.

Therefore, 9 should have one closed loop, so it should be paired with 1.

[...]

Then it continued trying another approach related to the number of angles in the digit shape, but it discarded it. Then "Alternatively, perhaps it's related to the number of segments that are 'on' in a seven-segment display." and so on.


a possible lesson to infer from this example of human cognition, would be that LLMs that can't solve the strawberry test could not be automatically less cognitive capable that another intelligent entity (humans by default).

An extension of the idea could be that many other similar tests trying to measure and/or evaluate machine cognition, when the LLMs fails, are not precisely measuring and/or evaluating anything else than an specific edge case in which machine cognitions fails (i.e. for the specific LLM / AI system being evaluated).

Maybe the models are actually more intelligent than they seem, like an adult failing the number of circles inside the graphical image of the numbers, in the mentioned problem.


Damn I guessed the answer to be 9...

88880 = 9

I don't know if this is being done already, but couldn't we add some training data to teach the LLM how to spell? We also teach kids what each letter means and how they combine into words. Maybe we can do this with tokens as well? E.g.:

Token 145 (ar) = Token 236 (a) + Token 976 (r)

Repeat many times with different combinations and different words?


> but couldn't we add some training data to teach the LLM how to spell?

Sure, but then we would lose a benchmark to measure progress of emergent behavior.

The goal is not to add one capability at a time by hand - because this doesn’t scale and we would never finish. The goal is that it picks up new capabilities automatically, all on its own.


Training data is already provided by humans and certainly already does include spelling instruction, which the model is bind to because of forced tokenization. Tokenizing on words is already an arbitrary capability added one at a time. It's just the wrong one. LLMs should be tokenizing by letter, but they don't, because they aren't good enough yet, so they get a massive deus ex machina (human ex machina?) of wordish tokenization.

I think it's a "just so story" myth that children are better than adults at that task.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: