Sorry for the alarming title but, Admins for real, go set up Anubis.

For context, Anubis is essentially a gatekeeper/rate limiter for small services. From them:

(Anubis) is designed to help protect the small internet from the endless storm of requests that flood in from AI companies. Anubis is as lightweight as possible to ensure that everyone can afford to protect the communities closest to them.

It puts forward a challenge that must be solved in order to gain access, and judges how trustworthy a connection is. For the vast majority of real users they will never notice, or will notice a small delay accessing your site the first time. Even smaller scrapers may get by relatively easily.

For big scrapers though, AI and trainers, they get hit with computational problems that waste their compute before being let in. (Trust me, I worked for a company that did “scrape the internet”, and compute is expensive and a constant worry for them, so win win for us!)

Anubis ended up taking maybe 10 minutes to set up. For Lemmy hosters you literally just point your UI proxy at Anubis and point Anubis to Lemmy UI. Very easy and slots right in, minimal setup.

These graphs are since I turned it on less than an hour ago. I have a small instance, only a few people, and immediately my CPU usage has gone down and my requests per minute have gone down. I have already had thousands of requests challenged, I had no idea I was being scraped this much! You can see they’re backing off in the charts.

(FYI, this only stops the web requests, so it does nothing to the API or federation. Those are proxied elsewhere, so it really does only target web scrapers).

  • 🇨🇦Samuel Proulx🇨🇦@rblind.com
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    1 day ago

    This is exactly my point, though. Yes, all of these changes are easy and possible. But you have to know about them, first. This is not a drop-in “protect everything without side effects” tool like your initial post seems to say. For every app you put behind it, you need to take time to think over exactly what access is required by whom, when, and how. Does it use Oauth, RSS, .well-known, xmlrpc/pingbacks, RDF/sparql endpoints, etc? Do some robots need to be allowed (for federation, discoverability, automated healthchecks, etc)? Are there consumers of API’s provided by the app? Will file downloads occur from downloaders that resume downloads/chunking for multiple connections at once? What is the profile of the humans you expect to be accessing the service: are they using terminal browsers like lynx, do they disable JavaScript and/or cookies for privacy, are they on a VPN, are they using low profile devices like raspberry pi’s or low-end android tablets, etc? What bots are you intending to block and how do they behave: they may just be running headless chrome and pass all your checks, they may be on zombie consumer machines part of a botnet, etc. As with anything in life, there are no magical shortcuts, and no way to say “block all the bad people I don’t like and allow the good people in” without first defining who the good people are and what you don’t like.

    In your case, all you’ve effectively done is said “good people run JavaScript and allow cookies, bad people do not”. Without really thinking through the implications of that. I suspect what you really mean is “I don’t need or want anyone but me accessing my personal lemmy instance”. So why not block lemmy-ui from every country but your own, or even restrict it to subnets belonging to the ISPs you use? That would seem to be a lot easier in the case of a personal instance. In the case of a public instance like mine, though, the problem is much harder.

    • Scrubbles@poptalk.scrubbles.techOP
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      24 hours ago

      Wow, alright, I am not here to argue. Feel free to continue arguing without me. I found a tool that quite literally was a drop-in addition for me, I had maybe one line of configuration I had to set to set it up. Silly me, I thought “Man, more admins need to know about this, hopefully they can save some money and prevent some bot traffic”.

      Little did I know that doing something like giving my suggestions on things that helped me would be such an angering thing. God damn every goddamn time I post someone’s gotta through a thousand goddamn pedantic things at me. Ffs yes I know there are probably gotchas, and above I even added them. I’ve been trying to help people here and in other places I crossposted this so we could see how it would work for them. If people wanted help debugging their configurations I was ready to help debug things, offering configuration and solutions. But no, that’s too goddamn much here.

      I mean what do I fucking know okay? I’ve only ran apps for over a decade like this, both myself with services here and professionally, from IIS to kubernetes from startups to big tech, I mean my job title is only senior SRE. But yes, thank you for showing me the error of my ways. I will keep things that have helped me to myself now, because posting here has been exhausting.