Admins: Set up Anubis ASAP!

Scrubbles@poptalk.scrubbles.tech · 22 hours ago

Admins: Set up Anubis ASAP!

mesa@piefed.social · edit-2 15 hours ago

I created a honeypot that is only accessible if they click the “don’t click this unless you are a bot”. If they do after 3 times, poof the IP gets banned for a day. Its worked well.

Simple little flask app. Robots.text as well but only google seems to actually read that and respect it.

Rimu@piefed.social · 1 minute ago

Good idea, I’ll add something similar to PieFed.

nfreak@lemmy.ml · 9 hours ago

Before banning them you should send them a zip bomb: https://ache.one/notes/html_zip_bomb

Scrubbles@poptalk.scrubbles.tech · 15 hours ago

I’ve noticed you can ask the most basic thing in registrations and people bots will just ignore it

ryannathans@aussie.zone · 18 hours ago

Lmao genius, until crawlers are LLMs or similar

mesa@piefed.social · 17 hours ago

I fail to see how prediction engines can do anything different.

ferret@sh.itjust.works · 13 hours ago

LLMs are extremely compute-expensive. They will never be used for large-scale web scraping

marcie (she/her)@lemmy.ml · 3 hours ago

Not necessarily for scraping but large bot account farms do use llms to parse text to know important parts of the site to interact with. Usually they run cheap ram only llms that don’t use much resources (300mb-1gb of ram)

Anna@lemmy.ml · 17 hours ago

Yes please use Anubis instead of captcha I constantly use Tor and VPN and Captcha take up a lot of time and I’ve to click multiple times but Anubis takes <1sec and I don’t have to do anything

🇨🇦Samuel Proulx🇨🇦@rblind.com · 21 hours ago

What about folks on low spec Android phones? Or folks who browse with JavaScript off? Every solution to block AI will block some percentage of humans.

Scrubbles@poptalk.scrubbles.tech · 20 hours ago

As I mentioned they are held up for a few seconds once. After the trust is established a cookie is set and they pass through freely. For my instance it’s been more responsive even because the bot traffic is gone.

Goretantath@lemmy.world · 19 hours ago

And if their browser doesn’t store cookies?

dontsayaword@piefed.social · edit-2 18 hours ago

As a user who uses a VPN and doesn’t store cookies, I have to wait up to a minute basically every time I visit one of these sites. But I get it. I blame the bots, not the admins.

Scrubbles@poptalk.scrubbles.tech · 15 hours ago

If you choose to disable core features of your browser then yes you will have reduced functionality in your browser. That is a tradeoff you have made.

🇨🇦Samuel Proulx🇨🇦@rblind.com · 19 hours ago

And what about RSS? Favicons? OAuth? robots.txt? There are lots and lots of things that need to be accessed by automated programs without user intervention. It is not trivial to determine what these things might be. For your personal instance, go nuts. But no public instance should be doing this.

poVoq@slrpnk.net · 18 hours ago

Those work fine with Anubis.

Anubis is fairly stupid in reality. It only checks the request at all if it looks like a regular browser (and thus catches the scrapers that pretend to be regular browsers to hide in normal traffic). If you use an RSS reader for example that doesn’t hide the fact that it is a RSS reader, then Anubis will send it right through.

🇨🇦Samuel Proulx🇨🇦@rblind.com · 6 hours ago

Good to know. But most RSS readers already pretend to be browsers, because otherwise many publications with misconfigured reverse proxies will block them from accessing the RSS feed. cbc.ca is a good example of this. Because deploying a web firewall is neither easy or trivial, unless you know exactly who needs to access what, when, and why. Most people, in my experience, do not.

Scrubbles@poptalk.scrubbles.tech · 15 hours ago

All of those both work with Anubis, and if you didn’t want them to go through Anubis would be trivial to have bypass it with one line of proxy config.

🇨🇦Samuel Proulx🇨🇦@rblind.com · 6 hours ago

In brief testing I get challenges before trying to load robots.txt on hosts running Anubis. I also see reports of it blocking OAuth flows and access to stuff like .well-known

Scrubbles@poptalk.scrubbles.tech · 5 hours ago

Well known I think I would bypass Anubis completely. Have that set as a separate block in the proxy and just continue on to whatever backend app it needs. OAuth… yeah I could see it having issues with the callback in OAuth, if it started thinking the callback endpoint was a bot. You could fix that by not using Anubis for that endpoint. In NGinx, that’s just a location block like:

location = /oauth/callback {
  proxy_pass http://lemmy-ui:1234/;
}

Where the other ones through the standard routing. Admittedly more setup.

🇨🇦Samuel Proulx🇨🇦@rblind.com · 4 hours ago

This is exactly my point, though. Yes, all of these changes are easy and possible. But you have to know about them, first. This is not a drop-in “protect everything without side effects” tool like your initial post seems to say. For every app you put behind it, you need to take time to think over exactly what access is required by whom, when, and how. Does it use Oauth, RSS, .well-known, xmlrpc/pingbacks, RDF/sparql endpoints, etc? Do some robots need to be allowed (for federation, discoverability, automated healthchecks, etc)? Are there consumers of API’s provided by the app? Will file downloads occur from downloaders that resume downloads/chunking for multiple connections at once? What is the profile of the humans you expect to be accessing the service: are they using terminal browsers like lynx, do they disable JavaScript and/or cookies for privacy, are they on a VPN, are they using low profile devices like raspberry pi’s or low-end android tablets, etc? What bots are you intending to block and how do they behave: they may just be running headless chrome and pass all your checks, they may be on zombie consumer machines part of a botnet, etc. As with anything in life, there are no magical shortcuts, and no way to say “block all the bad people I don’t like and allow the good people in” without first defining who the good people are and what you don’t like.

In your case, all you’ve effectively done is said “good people run JavaScript and allow cookies, bad people do not”. Without really thinking through the implications of that. I suspect what you really mean is “I don’t need or want anyone but me accessing my personal lemmy instance”. So why not block lemmy-ui from every country but your own, or even restrict it to subnets belonging to the ISPs you use? That would seem to be a lot easier in the case of a personal instance. In the case of a public instance like mine, though, the problem is much harder.

Scrubbles@poptalk.scrubbles.tech · 4 hours ago

Wow, alright, I am not here to argue. Feel free to continue arguing without me. I found a tool that quite literally was a drop-in addition for me, I had maybe one line of configuration I had to set to set it up. Silly me, I thought “Man, more admins need to know about this, hopefully they can save some money and prevent some bot traffic”.

Little did I know that doing something like giving my suggestions on things that helped me would be such an angering thing. God damn every goddamn time I post someone’s gotta through a thousand goddamn pedantic things at me. Ffs yes I know there are probably gotchas, and above I even added them. I’ve been trying to help people here and in other places I crossposted this so we could see how it would work for them. If people wanted help debugging their configurations I was ready to help debug things, offering configuration and solutions. But no, that’s too goddamn much here.

I mean what do I fucking know okay? I’ve only ran apps for over a decade like this, both myself with services here and professionally, from IIS to kubernetes from startups to big tech, I mean my job title is only senior SRE. But yes, thank you for showing me the error of my ways. I will keep things that have helped me to myself now, because posting here has been exhausting.

poVoq@slrpnk.net · 20 hours ago

They just released a big new version.

We have been running it since a year or so, but lately there seem to be some scrapers that get around it, probably by using a 3rd party webfrontend and thus accessing the API endpoint. But still better than nothing I guess.

Scrubbles@poptalk.scrubbles.tech · 20 hours ago

Maybe that’s why some people think it’s difficult but I didn’t find it to be. Personally it’s blocking 90% of the traffic, which for a personal instance I know that is accurate, so I’m ecstatic

Björn@swg-empire.de · 20 hours ago

It’s not trivial. I remember at least two instances of a bigish Lemmy server having problems with images and federation because of Anubis.

Scrubbles@poptalk.scrubbles.tech · 20 hours ago

I mean, it was trivial for me. I run in kubernetes with a test environment with docker compose and for both of them I spun up the extra container, and then after testing that container I just swapped the Lemmy proxy over to use Anubis first. To me that’s trivial, but ymmv