An update to Google’s privacy policy suggests that the entire public internet is fair game for it’s AI projects.

  • geekworking@lemmy.world
    link
    fedilink
    English
    arrow-up
    49
    arrow-down
    3
    ·
    1 year ago

    Isn’t crawling and scraping content what Google and every other search engine has been doing since day one?

    • kenyard@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      2
      ·
      1 year ago

      I’m guessing previously they indexed the data and didn’t actually use it. Nowadays they’re using it directly themselves. You could argue previously they were making money from Google ads but google was always ad free I believe. Results had sponsors for sure but that wasn’t linked to your data

      • Cordoro@lemmy.world
        link
        fedilink
        English
        arrow-up
        8
        ·
        1 year ago

        Indexing is using it. They’d use the content of the page to decide how best to index it for search.

    • kenyard@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      5
      ·
      1 year ago

      I’m guessing previously they indexed the data and didn’t actually use it. Nowadays they’re using it directly themselves. You could argue previously they were making money from Google ads but google was always ad free I believe. Results had sponsors for sure but that wasn’t linked to your data

    • kenyard@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      9
      ·
      1 year ago

      I’m guessing previously they indexed the data and didn’t actually use it. Nowadays they’re using it directly themselves. You could argue previously they were making money from Google ads but google was always ad free I believe. Results had sponsors for sure but that wasn’t linked to your data

  • Sterile_Technique@lemmy.world
    link
    fedilink
    English
    arrow-up
    35
    ·
    1 year ago

    The most realistic part of the Marvel movies was Ultron dipping into the internet for 4 seconds, then deciding humans need to be extinct.

    • Gsus4@lemmy.one
      link
      fedilink
      English
      arrow-up
      6
      ·
      edit-2
      1 year ago

      “Don’t look up” also has those “and this is why God sent the flood” (aka what is this shit, start over!) vibes.

  • Gsus4@lemmy.one
    link
    fedilink
    English
    arrow-up
    28
    arrow-down
    1
    ·
    1 year ago

    I don’t see why this is a problem (apart from supposedly private data like email), it’s not just Google that can do this, all this data is available to everyone for everyone who can use it to benefit. If you want to make Google pay for a publicly available good, tax them accordingly. That’s the point of taxes: if you are successful enough to take advantage in any way from a country’s public roads, education system, access to a labour market and a functioning society generally, taxing the massive profits from using that system is fair, not enclosing everything and holding access to the content we contributed hostage.

    • Dr. Moose@lemmy.world
      link
      fedilink
      English
      arrow-up
      11
      arrow-down
      1
      ·
      1 year ago

      Yeah public data is public. If anyone doesn’t want their shitty comments or whatever to be used for AI training then put it behind a login or something.

      • CIA_chatbot@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        arrow-down
        2
        ·
        1 year ago

        Except that’s not true, public posting of content does not trump copyright protection. Google using content for AI purposes is almost certainly a copyright issue. I may post content for human consumption but that does not mean I allow it to be used by a private corporation for profit purposes

        • Dr. Moose@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          2
          ·
          1 year ago

          Can we please not empower copyright to such silly extent? Copyright is already utter garbage and some want to extend to tweets, comments and whatnot.

          Also, AI is copying the same way we copy everything - by learning. So we shouldn’t be allowed to quote and refer to stuff we learn about online? In no way this argument makes sense - down with the copyright.

          • CIA_chatbot@lemmy.world
            link
            fedilink
            English
            arrow-up
            4
            arrow-down
            2
            ·
            1 year ago

            That’s not empowering copyright. That’s literally how it works. Copyright is automatic, and if you do not have a prior agreement assigning copyright it is awarded to the person who created said content, be it a tweet, blog post, etc.

            If I make a blog post and google scrapes the data and uses that day for profit, that’s copyright infringement, unless they can prove fair use, which has narrow definitions that training an AI for profit purposes definitely doesn’t fall under

            • Dr. Moose@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              arrow-down
              1
              ·
              1 year ago

              I dread reality you’re describing where every bit of information is propriatory. I think the world is a better place with free information. What you’re describing sounds whole lot like throwing the baby out with the bath water - just because big tech corporations are “bad”.

              • CIA_chatbot@lemmy.world
                link
                fedilink
                English
                arrow-up
                1
                ·
                1 year ago

                I mean you can dread it all you want, because that is LITERALLY how it works today. Google, OpenAI and Microsoft already have multiple lawsuits for stealing people’s copyrights to train their LLMs.

                Copyright is assigned automatically. If I make a blog post, that is automatically my copyrighted material. As the creator I get to choose how it’s used, not Google

                If I took some proprietary Google code and used it without permission you know damn well they would sue my ass into oblivion. Copyright has to protect the small as well as the giant.

                • Dr. Moose@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  arrow-down
                  1
                  ·
                  edit-2
                  1 year ago

                  I don’t think you understand.

                  Let’s imagine everything is copyrighted. Who will be able to create LLMs now? Google/Meta who can afford to literally hire thousands of people on below minimum wage creating annotations or smaller companies and free projects? You are literally empowering the thing you’re complaining about.

                  Public data is public and that’s good for general balance. It removed the moats.

    • boonhet@lemm.ee
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      If you want to make Google pay for a publicly available good, tax them accordingly.

      Tax them where? In the US? But a lot of the content they scrape would be European. So does EU get to tax them for content scraped from EU users and US for content scraped from US users? Actually, how DO we define the locality of online content? By host server? Site owning company/person’s legal location? Content poster’s location?

      Much as I’d love to see Google pay more taxes, I’m not sure how this would play out.

        • Gsus4@lemmy.one
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          1 year ago

          It did not come out of nowhere, it’s right in there: I mentioned taxes because using a public good/service is only freeloading (like people imply with google scraping public data or Elon when he talks about data pillaging) if you don’t pay for its upkeep.

    • SamB@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 year ago

      As long as they don’t present that data as their own, I am fine with it. But wait, that’s exactly what they’re doing… I have a vision of a thousand lawsuits shoved down the throat of the mighty Alphabet.

  • renrenPDX@lemmy.world
    link
    fedilink
    English
    arrow-up
    25
    ·
    1 year ago

    Why is AI scraping not respecting robots.txt? It wasn’t ok early internet days, so why is it ok now? People are complaining about being overloaded by scrapers like it’s the 90’s

      • sudo@lemmy.fmhy.ml
        link
        fedilink
        English
        arrow-up
        13
        ·
        edit-2
        1 year ago

        Here’s an example https://www.google.com/robots.txt

        Basically it’s a file people put in their root directory of their domain to tell automated web crawlers what sections of the website and what kind of web crawlers are allowed to access their resources.

        It isn’t a legally binding thing, more of a courtesy. Some sites may block traffic if they’re detecting the prohibited actions, so it gives your crawlers an idea of what’s okay in order to not get blocked.

      • renrenPDX@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        1 year ago

        It’s a plain text file that is hosted on your site that should be visible to the internet. Basically allows/disallows scraping from search engines in your site.

  • TWeaK@lemm.ee
    link
    fedilink
    English
    arrow-up
    22
    ·
    1 year ago

    I see a problem with it - just like there is a problem with all their data collection. They are taking our data without consideration nor compensation, and using it for their profitable commercial enterprise. They should be paying us for that data.

    You can’t build a car without paying for the nuts and bolts. Yet that is exactly what they’ve been doing, and they’ve become filthy rich doing it, at the expense of every one of us.

  • Deref@kbin.social
    link
    fedilink
    arrow-up
    15
    arrow-down
    1
    ·
    1 year ago

    Don’t see a problem with it as long as they don’t get copyright on the outputs of their AI. That would make enforcing any IP impossible on the internet because there’s no way to prove it wasn’t AI generated.

    • Deceptichum@kbin.social
      link
      fedilink
      arrow-up
      8
      arrow-down
      3
      ·
      1 year ago

      So that’s a win-win?

      Google uses free content we’ve all put out there open for everyone and we lose an archaic idea of private ownership.

  • InverseParallax@lemmy.world
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    1
    ·
    1 year ago

    I believe them…

    This is why I’ve never trusted google.

    They’re very nice people, but everybody is decent until they need to hit quarterly earnings.

  • ihavenopeopleskills@kbin.social
    link
    fedilink
    arrow-up
    3
    ·
    1 year ago

    Not saying it’s a comforting thought but that’s one of several reasons why one doesn’t post anything online if they aren’t comfortable with outcomes such as this.