(Go stick your head in a pig!)

Come to think of it, “share and enjoy” is exactly the way I would expect an AI-generated YouTube video to end.

  • Catoblepas@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    7
    ·
    19 days ago

    The second study link you gave didn’t find that it was ‘better’ than human writers, it concluded that if you do a lot of fine tuning then it can summarize news stories in a way that six people (marginally better than n=1 anecdote in the first link, I guess?) rated on par with Amazon mturk freelance writers. And they also noted that this preference for how the LLM summarized was individual, as in blind tests some of them still just disliked it. There are leagues and leagues of room between that and “summarizes better than humans.”

    You and I both know that 99.9% of people are not fine tuning LLMs that way when they ask for a summary, which means almost nobody is going to be getting that ‘kind of as good as a person’s summary maybe if you like that style of summarizing.’ They’re getting the predictive text slop. Like, good for you if you aren’t, but maybe you should be a bit more upfront about how little you trust it and how much work you have to do to get it to give you an accurate (maybe?) summary?

    My problem with LLMs is that it is fundamentally magic-brained to trust something without the power to reason to evaluate whether or not it’s feeding you absolute horseshit. With a human being editing Wikipedia, you trust the community of other volunteers who are knowledgeable in their field to notice if someone puts something insanely wrong in a Wikipedia article. An LLM will tell you anything and phrase it with enough confidence that someone with no expertise on a subject won’t know the difference.

    • Ragdoll X@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      18 days ago

      There are several problems with these arguments.

      […] it concluded that if you do a lot of fine tuning then it can summarize news stories in a way that six people (marginally better than n=1 anecdote in the first link, I guess?) rated on par with Amazon mturk freelance writers. […]

      It concluded quite the opposite actually. From the “Instruction Tuned Models Have Strong Summarization Ability.” section:

      Across the two datasets and three aspects, we find that the zero-shot instruction-tuned GPT-3 models, especially Instruct Curie and Davinci, perform the best overall. Compared to the fine-tuned LMs (e.g., Pegasus), Instruct Davinci achieves higher coherence and relevance scores (4.15 vs. 3.93 and 4.60 vs. 4.40) on CNN and higher faithfulness and relevance scores (0.97 vs. 0.57 and 4.28 vs. 3.85) on XSUM, which is consistent with recent work (Goyal et al., 2022).

      You might be confusing instruction-tuning with fine-tuning for text summarization. Instruction tuning involves rewarding a model based on the helpfulness of its responses in a user-assistant setting, and it’s the industry standard ever since the first ChatGPT showed its effectiveness.

      Also they actually recruited thirty evaluators from MTurk, and six writers from Upwork (See “Human Evaluation Protocol” and “Writer Recruitment”).

      Their conclusions are also consistent with the study you linked to, since which they fine-tuned the Mistral and Llama models in an attempt to generate better summaries but the evaluators still rated them lower to the human summaries. Though I’m not sure that you will be convinced by this study either since, as they state in the “PHASE 3 – FINAL ASSESSMENT” section:

      ASIC engaged five business representatives (EL2 level staff across two business teams) to assess both the human and AI generated summaries. Each assessor was assigned one submission to read and rate the two associated summaries - labelled A and B.

      Even putting all of this aside, you can actually use custom ChatGPTs that have been fine-tuned specifically to write summaries and test them for yourself if you want:

      And they also noted that this preference for how the LLM summarized was individual, as in blind tests some of them still just disliked it. There are leagues and leagues of room between that and “summarizes better than humans.”

      The exact same thing can be said about the lower scores in the study you linked to, so what is the exact threshold? Would you only trust an AI to summarize things if 100% of humans liked it? Besides, even if you think the best model in the study was still not good enough, there are other, even better models that have been published since then, like the ones at the top of the aforementioned leaderboards, and others like GPT-4o, OpenAI o1 and OpenAI o3.

      An LLM will tell you anything and phrase it with enough confidence that someone with no expertise on a subject won’t know the difference.

      That’s why I linked to the first article where they specifically asked an actual lawyer to evaluate summaries of legal texts written by LLMs and interns - and as we can see he thought the AI was better.

      My problem with LLMs is that it is fundamentally magic-brained to trust something without the power to reason to evaluate whether or not it’s feeding you absolute horseshit. With a human being editing Wikipedia, you trust the community of other volunteers who are knowledgeable in their field to notice if someone puts something insanely wrong in a Wikipedia article.

      Whenever I’ve gotten into debates about the philosophy of AI and its relation to things like art, reason and consciousness, the arguments I’ve seen always end up being rather inconsistent and condescending, so I’m not even going to get into that. However I will point out that if we take the general definition of reasoning to mean “drawing logical conclusions through inference and extrapolations based on evidence”, the Wikipedia pages on LLMs, OpenAI o3 and “commonsense reasoning” explicitly describe AIs as reasoning. You’re welcome to disagree with this assessment, but if you do I hope we can then agree that, as I stated previously, Wikipedia contributors and their sources aren’t always reliable.

      But sure, let’s put that aside and assume that reasoning is a magical aspect of the human brain that inherently excludes AI, so LLMs simply can’t reason… So what?

      AlphaFold can’t reason, but it still can predict the structure of proteins better than humans, so it would be naive to not use it simply because it doesn’t reason. In the same vein, even if you want to conclude that LLMs can’t reason this doesn’t change the fact that they are useful tools, and perform either equal to or even better than humans in many tasks, including summarizing text.

      LLMs are, for all intents and purposes, just really complicated functions that model some data distribution we give to it. Language obviously has a predictable distribution since we don’t speak/write randomly, so given proper data and training there’s no reason to believe that an AI can’t model that even better than humans. Hell, we don’t even need to get so conceptual and broad with these arguments, we can just look at the quantitative results of these models, and assess their usefulness ourselves by simply using them.

      Again, I don’t trust everything these AIs generate, there are things for which I don’t use them, and even when I do sometimes I just don’t like their answers. But I see no reason to believe that they are inherently more harmful than humans when it comes to the information they generate, or that even in their current state that they’re dangerously inaccurate. If nothing else I can just ask it to summarize a Wikipedia page for me and be confident that it’ll be accurate in doing so - though as the links I mentioned demonstrate, and as you may have come to believe after considering Wikipedia’s assessment of AI reasoning, the Wikipedia contributors and their sources aren’t 100% reliable.

      We both know that humans fall for and say absolute horseshit. Heck, your comment is a good example of this, where you moved the goalpost again, failed to address or outright ignored many of my arguments, and didn’t properly engage with any of the sources cited, even your own.

      If you just dislike AI on principle because this technology inherently bothers you that’s fine, you’re entitled to your opinion. But let’s not pretend that this is because of some logical or quantifiable metric that proves that AI is so dangerous or bad that it can’t be used even to help university students with some basic tasks.