Extracting the Source of Apache Web Requests

I had a query come in recently from one of our users that told me our website was running slowly. We have quite a beefy cloud-based VM so I knew that it had enough oomph, so what was the cause of this? To begin my hunt, I headed over to the Apache logs which would help me find the source of any web requests. This you can (usually) find at:

Not the type of logs you should be inspecting. Photo by Khari Hayden from Pexels

At this point, it can all depend on how you set up you sites in Apache, but for our purposes, let’s assume my file is called: access.log

Log File Contents

If we take a look inside, this is the sort of thing you can expect to see:

To make it easier to discuss, I have numbered the various columns; most are obvious, but let’s take a look at each in turn.

  1. The IP address. This one is the localhost, but it could be anything.
  2. The identity of the client – often unknown so marked as a hyphen in our system.
  3. The user which made the request – again, unknown often for us.
  4. The time and date the request was made.
  5. The resource referenced.
  6. The type of request method – most will be GETs.
  7. The protocol used.
  8. The status code – in this example 200 means OK and,
  9. The size of the request in bytes.

For now, I was just interested in who was making lots of requests because it might have been someone deliberately hitting the site (DOS style) or someone being a bit mean and scraping the contents too anti-socially.

Extracting the IP Address

Let’s pull out the IP address which sits in column one, but I’ll only work with the first 6 entries to keep things small. Oh, needless to say, these are completely fake IP addresses!

Counting Occurences

OK, that’s a good start, but what would be more useful is if I could see how many times each IP address appeared. Let’s start using some in-built UNIX commands to achieve that.

Much better. To get the effect I wanted, I have sorted the IPs (sort), removed duplicates but counted the total at the same time (uniq -c) and then numerically ordered the results (sort -n) so that the biggest offenders are at the bottom.

Looking Up Hostnames

That really helps – I now know that the machine at 123.12.11.133 is being completely outrageous and making 3 requests to my website. Goodness gravy! But who is that person or entity? I need to do some digging.

Ah – so madeupname.com is the culprit. That’s much easier to talk about and gives me a clue about why they might be hammering my site with not one, not two but a ridiculous three requests.

Combining the Commands in a Script

Now let’s combine the two commands so that my output includes a count of requests, the IP address AND the hostname it refers to. For this, I am going to use a BASH script so type the following into a file called ips.sh and save it.

Don’t forget to set the executable flag or you won’t be able to run it.

Looking Under the Hood

To explain how this works, I’m going to start at the top and work my way down for this.

On lines 3 and 4, I initialise IFS, which is the internal field separator. By entering the command above, I am telling BASH not to chop up things based on spaces, which is the default, but to treat tokens as whole lines. That will make more sense in a minute.

Next we run our command to grab the IP addresses and place them in a variable called command. I’ve made a couple of changes here, too. I’m now reading from the first argument of the script on the command line which means we run it like this:

but also, I am no longer just reading the first 6 lines of the file – I am using ALL of it, which can amount to tens of thousands of lines. Once done, though, I only want the top 15 abusers – hence the tail -15 at the end.

Lines 7, 8 and 12 handle the looping over the results stored in command. Remember we set the IFS to use whole lines? Well, without that, we would be working on space delimited words on each line, rather than the whole line. Ultimately, I needed to treat each line as one entity.

OK, onto the meat of the script.

Line 9 uses awk again to yank out the IP address from column 2. Bear in mind, our first awk command produced output like this:

where 100 was the count of IP instances and the second part was the actual IP address; that’s why  we take $2, the second column.

Line 10 looks up the hostname using that IP address and dig.

Lastly, line 11 puts it all back together and outputs it for your perusal, and there you go – one quick and easy script to work out who is hitting your website the most.

Now all I need is another script called: stophittingmywebsite.sh.

twittergoogle_plusredditpinteresttumblrmail

Written by Stephen Moon
email: stephen at logicalmoon.com
www: https://www.logicalmoon.com


Hey! Did you enjoy reading this? If you did and would like an email when I add new content, just subscribe to my list. You can unsubscribe at any time.


Leave a Reply

Your email address will not be published. Required fields are marked *