Petit image
Back to Advanced
             
~ PHP Regex Spider: A first draft ~
by Frank Mitchell




Originally published @ searchlores in September 2004     Version 1.02, Updated in September 2004



PHP Regex Spider: A first draft

or

xi2 : object / verb

PHP Regex Spider: A first draft

Still to do

  1. Come up with some rules for displaying code within prose.

  2. Add links to the appropriate manual pages when a function is first talked about.

  3. Get someone with a less technical background to read over this and makes sure it's understandable.

  4. Provide a zipped download of the full source so people don't have to cut and paste.

Introduction

Spiders are one of those little tools in a seeker's bag of tricks that you really cannot appreciate the power and beauty of until you build (or at least tweak) one of your own. To that end, the purpose of this essay is the guide you through the evolution of a PHP spider from conception to code. My goal is that by the end you will have a working spider (whose workings you understand) that you can modify and send out into the wild web to find and retrieve your targets.

This essay is primarily directed towards those seekers who are new it PHP. It holds your hand every step of the way. If you don't want that kind of babysitting, please feel free to skip ahead to the end and the completed code.

Your bag of tools

You'll need PHP installed and running. I'm using version 5.0.1. Your milleage may very if you're got something newer or older. You'll also need a good text editor (or another good one).

In the begining, there was a need...

I've always wanted my own spider, but mostly I've been too lazy to either code one or learn enough PHP to understand the ones that have been coded by the wizard programers of that castle. I am not a programer; my background lies much more in the creation and study of human languages than it does in machine code. Still, one learns best by doing, though it wasn't until I came across a target that necessitated a spider that I decided to write this one.

The roads most traveled are the ones used by sheep. That means that quite a few of the locked sites you come across (the kind that ask for a user name and password) also provide some kind of functionality by which the user can retrieve their lost password. Lots of the time that retrieval is accomplished by answering a secret question. Becuase people are predictable, the answers to their secret questions are predictable as well.

Before you can guess, however, you must get a name. This is were the laziness of people becomes helpful. The kinds of people who will have easy to guess secret questions are also the kinds of people who would use the same user name for both a public forum and a member's only site.

It was in this sort of scenario that I found myself, and it was from here that my spider was born. It's job, harvest a list of user names from a website.

So, you want a spider huh?

Learning to read (and thus use) other people's code is something of an art form. However, authors can make it easy by providing verbose comments. If you don't like prose, feel free to skip to the end. All the source is provided there. Still here? Good, let's jump right in.


#!/usr/bin/php
<?php
    echo("\n");
    ...
?>

Those four lines are my standard openers to any PHP program. The first lets my computer know where PHP can be found. The send begins a block of PHP code, and the third prints a new line to the screen so any output I generate from here on out isn't smashed up against the command line. The elipsis represent where a block of code has been removed to make explanations easier. The final line marks the end of a block of PHP code.


$start = "http://searchlores.org/";

This line answers the question "Where should we start searching from?" The variable $start (in PHP all variables begin with a dollar sign) is initialized and told to contain the string "http://searchlores.org/" which is the URL our spider will start searching from.


$search = "/<em>(.*?)<\/em>/";

Another question answering line. "What should we be looking for?" The variable $search now holds a Perl compatible regular expression that will match anything between a set of emphasis tags. Yes, our spider has the ability to search within code. If you're not familiar with Perl regular expressions, I'd suggest you read up on them, 'cause this spider uses them quite a bit. Two of the resources I used during development were:

Anyways, back to the spider and the code.


if($url = parse_url($start))
{
    if(isset($url['scheme']))
    {
        $b_scheme = $url['scheme'];
        $b_url = $b_scheme."://";
    }
    if(isset($url['host']))
    {
        $b_host = $url['host'];
        $b_url = $b_url.$b_host;
    }
    if(isset($url['path']))
    {
        $b_path = dirname($url['path']);
        $b_url = $b_url.$b_path;
    }
}
else
{
    echo("\nError!\n");
    echo("Description: Unable to parse starting URL. ");
    echo("Please enter a different URL to start from.\n");
    echo("Starting URL: " .$start. "\n\n");
    exit;
}

Wow, that's a lot of stuff. Let's look at it line by line. The parse_url function in the argument of the if statement takes the URL held in $start and breaks it down into all it's separate pieces. These pieces (scheme, host, and path) are than stored in their own variables for use later. The else statement returns an error and terminates our spider if the URL we provided was unparsable (because we typed it in wrong or whatever).


$links = array($start => "0");
$gold = array();

The variables $links and $gold are both arrays. While strings, like $start, hold only one thing, arrays can hold lots of things. $links holds the links that our spider is going to follow, and so we initialize it with the URL held in $start. We give it an initial value of "0" (it's a key => value array) to indicate that that link hasn't been followed yet. Later, a value of "1" will indicate that a link has been followed; that way we don't end up following the same link twice.

$gold is what's going to hold the results of our search, and it's initialized as an empty numeric array. While the keys in $link take the form of URLs, the keys in $gold take the form of numbers. Zero is the first element in the array, one is the second, two is the third, etc. Elements of arrays are accessed with brackets (as we shall see later), and it is possible to have arrays of arrays, but we'll get to that when we get to that.


while($p_link = array_search("0", $links))
{
    ...
}

The function array_search is going to do just what it says. In this case, it is going to search the array $links for the first value that is "0" and return its key, which we'll store in $p_link. That means that $p_link is going to contain the URL of any links our spider hasn't followed yet. Making this the argument of a while loop means that the statements inside the while loop will continue to execute until all the links have been looked at. i.e. Our spider will keep going until it's run out of links.


$links[$p_link] = "1";

Here we're simply marking a link as having been seen. See how we can access the element of the $links array with brackets? If $links were a multi-dimensional array (an array of arrays) we'd access its elements with multiple sets of brackets like this: $array[$value1][$value2]


if(@ $contents = file_get_contents($p_link))
{
    ...
}

The function file_get_contents will take the URL stored in $p_link, retrieve it, and store its contents in the variable $contents. We're sticking it inside an if statement because if the file retrieval fails we want our spider to just go on to the next link. The at symbol tells PHP that it should supress any warnings generated by executing this line of code. We don't need a bunch of garbage filling our screen for every malformed URL we come across. After all, there are a lot of them on the web.


echo("Following link: " .$p_link. "\n");

It's good to know what links we're following, so here we output them to the screen. When outputing the contents of variables (like $p_link), they have to be escaped by putting periods around them.


if($url = parse_url($p_link))
{
    $p_url = $p_link;
    if(isset($url['scheme']))
    {
        $p_scheme = $url['scheme'];
        $p_url = $p_scheme."://";
    }
    if(isset($url['host']))
    {
        $p_host = $url['host'];
        $p_url = $p_url.$p_host;
    }
    if(isset($url['path']))
    {
        $p_path = dirname($url['path']);
        $p_url = $p_url.$p_path;
    }
}

Again, we're gathering information about the URL we're looking to for use later. The dirname function in the last if statement simply drops the file's name from the path. So /directory1/directory2/file1.html becomes /directory1/directory2.


preg_match_all($search, $contents, $search_results);

More Peral regular expressions. preg_match_all searches $contents (the contents of the URL we just followed) for any matches to $search (the regex we're looking for). The results are stored in $search_results, which just happens to be a multi-dimensional array.


for($i = 0; $i < count($search_results[1]); $i++)
{
    $result = $search_results[1][$i];
    if(array_search($result, $gold) === false)
    {
        $gold[] = $result;
    }
}

This for loop dumps our search results into our $gold array if they aren't there already. Notice how all the operations on $search_results are done from an array index of 1. That's because the array index of 0 contains matches for the entire regex we searched for, including the emphasis tags. Array index 1 just contains the stuff between the tags (which is the stuff we care about), so it's all we're going to look at.


preg_match_all("/href=\"(.*?)\"/", $contents, $link_results);

More extracting of stuff from $contents. This time we're getting all the links. We use a regular expression to match anything between the quotes of href="" which is the bit in an anchor tag that forms a URL. $link_results holds our results of that search.


for($i = 0; $i < count($link_results[1]); $i++)
{
    $c_link = $link_results[1][$i];
    $c_valid = true;
    $c_link = trim($c_link);
    ...	
}

This for loop interates through all the links in $link_results and picks them out one at a time. A boolean variable, $c_valid, is initially set as true. For now, we'll assume that the link we got from the page is one worth following. That may change in the future. The trim function is simply a safety precaution that removes any whitespace from the begining and end of our link.


if(@ $url = parse_url($c_link))
{
    if(isset($url['host']))
    {
        $c_host = $url['host'];
    }
    if(isset($url['query']))
    {
        $c_query = $url['query'];
    }
    if(isset($url['fragment']))
    {
        $c_fragment = $url['fragment'];
    }
}
else
{
    $c_valid = false;
}

Sometimes it seems like information gathering and error surpression is all our spider does. Look at $c_query and $c_fragment. Can you figure out what they might be holding? Don't worry, we'll use them here in a few minutes. If we can't parse the link, it's bad, so we'll set $c_valid as false.


if(preg_match("/^(http:|https:|ftp:|file:)/i", $c_link) &&
   strpos($c_host, $b_host) === false)
{
    $c_valid = false;
}

Marking external links as bad keeps our spider from wondering outside of the current site. An external link is defined as one that starts with either an http, https, ftp, or file scheme, and whose host name ($c_host) doesn't match the host name of the site we started on ($b_host).


elseif(preg_match("/^(mailto:|javascript:|news:)/i", $c_link))
{
    $c_valid = false;
}

More bad links. mailto, javascript, and news are also schemes we want our spider to avoid. The i at the end of the regex marks it as being case insensitive. The carrot at the begining means "the begining of the string should look like". The pipe symbols stand for or, and the parenthesis are just a way to group stuff.


elseif(preg_match("/\.(jpg|gif|png|ico)$/i", $c_link))
{
    $c_valid = false;
}
elseif(preg_match("/\.(zip|rar|tar|gz)$/i", $c_link))
{
    $c_valid = false;
}
elseif(preg_match("/\.(c|pl|py|js|reg|orig)$/i", $c_link))
{
    $c_valid = false;
}
elseif(preg_match("/\.(exe|java|class)$/i", $c_link))
{
    $c_valid = false;
}
elseif(preg_match("/\.(css|xml|txt|doc|pdf|lit)$/i", $c_link))
{
    $c_valid = false;
}
elseif(preg_match("/\.(mp3|wav|ra|pm)$/i", $c_link))
{
    $c_valid = false;
}

Don't follow sound files, archived files, executable files, pictures, etc. Notice how easy it would be to modify our spider so it kept of list of such things instead. Image searching, anyone?


if($c_valid)
{
    if(isset($c_query))
    {
        $c_link = preg_replace("/\?(.*?)$/", "", $c_link);
    }
    if(isset($c_fragment))
    {
        $c_link = preg_replace("/#(.*?)$/", "", $c_link);
    }
    ...
}

Remember the fragment and query parts we stored above? Here we get to throw them away. Fragments are the part of the URL that comes after a hash symbol. Queries are the part that comes after a question mark. Neither is good for our current purposes, though you'll have to evalute your own needs with regards to them. Some site use them extensively, thus making their pages that much harder for search engines (or spiders) to index.


if(preg_match("/^\//", $c_link))
{
    $c_link = $b_scheme."://".$b_host.$c_link;
}

Here's the parts of the code where we start transforming relative URLs (those that contain ./ and ../) into absolute ones. This is arguably one of the trickier parts of spider writing. The above code looks for URLs that that start with a /, and then tacks our starting scheme and host name onto them.


if(preg_match("/^\.\.\//", $c_link))
{
    preg_match_all("/\.\.\//", $c_link, $count);
    $count = count($count[0]);

    $c_link = preg_replace("/\.\.\//", "", $c_link);

    $p_path = preg_replace("/^\//", "", $p_path);
    $p_path = preg_replace("/\/$/", "", $p_path);

    $path_array = explode("/", $p_path);
    $new_path = "";
    for($j = $count; $j > 0; $j--)
    {
    array_pop($path_array);
    }
    for($j = 0; $j < count($path_array); $j++)
    {
        $new_path = $new_path.$path_array[$j]."/";
    }

    $c_link = $p_scheme."://".$p_host."/".$new_path.$c_link;
}

Wow, that's a nightmare. Here's the line by line. The preg_match function in the if statement finds the URLs that start with ../. The next two lines use preg_match_all and count to count how many directories we're going to have to back up from where we currently are i.e. how many ../'s there are in this URL. The next line, preg_replace, removes those ../'s. Leading and trailing slashes are removed from our current path ($p_path) too. Finally, we get to the tricky part. Our path gets exploded into an array ($path_array) by seperating it wherever there's a slash. Then, the required number of directories are removed from the end of the array with the array_pop function. Finally, the path is reassembled as $new_path. Last but not least, our link is put together with the scheme and host from the page we extracted it from.


$c_link = preg_replace("/^\.\//", "", $c_link);

This case is a lot easier than the last one. URLs that start with ./ are simply rewritten without it.


if(!preg_match("/^http:/", $c_link))
{
    if(preg_match("/\/$/", $p_url))
    {
        $c_link = $p_url.$c_link;
    }
    else
    {
        $c_link = $p_url."/".$c_link;
    }
}

Finally, all those URLs that are simply file.html are handled. We tack an absolute URL of the current working directory onto them and add a slash if they need it.


$c_link = preg_replace("/^http:\/\/www\./", "http://", $c_link);

You may or may not want to include this line of code in your spider. It assumes that http://searchlores.org/ and http://www.searchlores.org/ resolve to the same place, and the links get rewritten appropriatly so we don't look at the same pages twice. For some sites, that might be true; for other, it might not be. Play around and see.


if(!array_key_exists($c_link, $links))
{
    $links[$c_link] = "0";
}

Don't worry, the end is in sight. If we don't already have this link, we need to add it to our list and mark it as being one we haven't looked at. Because of the way this spider's written, it performs a breath first search. All the links from the first page are extracted and followed, then all the ones from the second page, then the third, and so and so forth until it runs out of links.


echo("\nTotal number of links followed was ".count($links).".\n\n");
echo("\nSearch results: \n\n");
for($i = 0; $i < count($gold); $i++)
{
    echo($gold[$i]. "\n");
}
echo("\nTotal number of search results found was ".count($gold).".\n\n");

Give us some results, spider! After all, that's what we built you for. Here's the total number of links we looked at, the results of our search, and the total number of results found. Of course, you could always save this stuff to a file if you so chose.

As always, in fieri

That's it; that's our whole spider. Of course there are many more paths you can take from here. An essay on common secret questions for password guessing seems like unexplored territory. In addition, the spider you now have can easily be modified to harvest email addresses instead of emphasis tags. Learn the techniques of the spammers and then perhaps we can develope methods to fight against them (email addresses that are stored on a non-public part of the server and encoded / decoded on the fly maybe?).

If you do make some changes to this spider, please let me know. It's a little CPU intensive due to all the Perl regex functions, so speed improvements would be welcome. Bugs in the code, spelling errors in my prose; feedback is always appreciated.

- Frank

Spider code


#!/usr/bin/php
<?php
   echo("\n");


   // Where should we start searching from?
   $start = "http://searchlores.org/";


   // What should we be looking for?
   $search = "/<em>(.*?)<\/em>/";


   // Build information about the site we're going to search.
   if($url = parse_url($start))
   {
      if(isset($url['scheme']))
      {
         $b_scheme = $url['scheme'];
         $b_url = $b_scheme."://";
      }
      if(isset($url['host']))
      {
         $b_host = $url['host'];
         $b_url = $b_url.$b_host;
      }
      if(isset($url['path']))
      {
         $b_path = dirname($url['path']);
         $b_url = $b_url.$b_path;
      }
   }
   else
   {
      echo("\nError!\n");
      echo("Description: Unable to parse starting URL. ");
      echo("Please enter a different URL to start from.\n");
      echo("Starting URL: " .$start. "\n\n");
      exit;
   }


   // Initialize our array of links.
   $links = array($start => "0");


   // Initialize our array of search results.
   $gold = array();


   // Keep crawling until we run out of links.
   while($p_link = array_search("0", $links))
   {
      // Mark this link as having been seen.
      $links[$p_link] = "1";


      // Get the contents of the link we're currently looking at.
      // If we fail this, there's no point in going further.
      // We're going to surpress PHP's warning messages here as well.
      if(@ $contents = file_get_contents($p_link))
      {

         // What link are we following?
         echo("Following link: " .$p_link. "\n");


         // Build information about the link we're currently looking at.
         if($url = parse_url($p_link))
         {
            $p_url = $p_link;
            if(isset($url['scheme']))
            {
               $p_scheme = $url['scheme'];
               $p_url = $p_scheme."://";
            }
            if(isset($url['host']))
            {
               $p_host = $url['host'];
               $p_url = $p_url.$p_host;
            }
            if(isset($url['path']))
            {
               $p_path = dirname($url['path']);
               $p_url = $p_url.$p_path;
            }
         }


         // Extract all the search matches from the current page.
         preg_match_all($search, $contents, $search_results);


         // Put the search results into our pot of gold.
         for($i = 0; $i < count($search_results[1]); $i++)
         {
            $result = $search_results[1][$i];
            if(array_search($result, $gold) === false)
            {
               $gold[] = $result;
            }
         }


         // Extract the links from the current page.
         preg_match_all("/href=\"(.*?)\"/", $contents, $link_results);


         // Loop through our extracted links and manipulate them.
         for($i = 0; $i < count($link_results[1]); $i++)
         {

            // Get an extracted link from out list and assume it's good.
            $c_link = $link_results[1][$i];
            $c_valid = true;


            // Trim any whitespace that might be on our link.
            $c_link = trim($c_link);


            // Build information about our extracted link.
            // If we can't parse the URL, don't continue.
            // Surpress all PHP warnings here as well.
            if(@ $url = parse_url($c_link))
            {
               if(isset($url['host']))
               {
                  $c_host = $url['host'];
               }
               if(isset($url['query']))
               {
                  $c_query = $url['query'];
               }
               if(isset($url['fragment']))
               {
                  $c_fragment = $url['fragment'];
               }
            }
            else
            {
               // If we won't be able to follow it, mark it as bad.
               $c_valid = false;
            }


            // Decide wether this link is internal or external.
            // If it's external, we don't want to follow it.
            if(preg_match("/^(http:|https:|ftp:|file:)/i", $c_link) &&
               strpos($c_host, $b_host) === false)
            {
               $c_valid = false;
            }


            // Don't follow javascript or mailto links.
            elseif(preg_match("/^(mailto:|javascript:|news:)/i", $c_link))
            {
               $c_valid = false;
            }


            // Don't follow pictures, zip files, etc.
            elseif(preg_match("/\.(jpg|gif|png|ico)$/i", $c_link))
            {
               $c_valid = false;
            }
            elseif(preg_match("/\.(zip|rar|tar|gz)$/i", $c_link))
            {
               $c_valid = false;
            }
            elseif(preg_match("/\.(c|pl|py|js|reg|orig)$/i", $c_link)
            {
               $c_valid = false;
            }
            elseif(preg_match("/\.(exe|java|class)$/i", $c_link))
            {
               $c_valid = false;
            }
            elseif(preg_match("/\.(css|xml|txt|doc|pdf|lit)$/i", $c_link))
            {
               $c_valid = false;
            }
            elseif(preg_match("/\.(mp3|wav|ra|pm)$/i", $c_link))
            {
               $c_valid = false;
            }


            // If our link's made it this far, it's good, so let's keep it.
            if($c_valid)
            {

               // Remove queries from the end of a link.
               if(isset($c_query))
               {
                  $c_link = preg_replace("/\?(.*?)$/", "", $c_link);
               }


               // Remove fragments from the end of a link.
               if(isset($c_fragment))
               {
                  $c_link = preg_replace("/#(.*?)$/", "", $c_link);
               }


               // Case 1: The URL is of the form: /directory/file
               if(preg_match("/^\//", $c_link))
               {
                  $c_link = $b_scheme."://".$b_host.$c_link;
               }


               // Case 2: The URL is of the form: ../directory/file
               if(preg_match("/^\.\.\//", $c_link))
               {

                  // How many directories will we have to backtrack into?
                  preg_match_all("/\.\.\//", $c_link, $count);
                  $count = count($count[0]);

                  // Remove the relative bits from our link.
                  $c_link = preg_replace("/\.\.\//", "", $c_link);

                  // Remove leading and trailing slashes from our path.
                  $p_path = preg_replace("/^\//", "", $p_path);
                  $p_path = preg_replace("/\/$/", "", $p_path);

                  // Backtrack the required number of directories.
                  $path_array = explode("/", $p_path);
                  $new_path = "";
                  for($j = $count; $j > 0; $j--)
                  {
                     array_pop($path_array);
                  }
                  for($j = 0; $j < count($path_array); $j++)
                  {
                     $new_path = $new_path.$path_array[$j]."/";
                  }

                  // Tack our new path onto the begining of our link.
                  $c_link = $p_scheme."://".$p_host."/".$new_path.$c_link;
               }


               // Case 3: The URL is of the form: ./directory/file
               $c_link = preg_replace("/^\.\//", "", $c_link);


               // Case 4: The URL is of the form: file
               if(!preg_match("/^http:/", $c_link))
               {
                  if(preg_match("/\/$/", $p_url))
                  {
                     $c_link = $p_url.$c_link;
                  }
                  else
                  {
                     $c_link = $p_url."/".$c_link;
                  }
               }


               // Remove any www. stuff from the start of our link.
               $c_link = preg_replace("/^http:\/\/www\./", "http://", $c_link);

               // Add our extracted list to our list of links to look at.
               if(!array_key_exists($c_link, $links))
               {
                  $links[$c_link] = "0";
               }
            }
         }
      }
   }


   // How many links did we end up following?
   echo("\nTotal number of links followed was ".count($links).".\n\n");


   // What kind of search results did we get?
   echo("\nSearch results: \n\n");
   for($i = 0; $i < count($gold); $i++)
   {
      echo($gold[$i]. "\n");
   }
   echo("\nTotal number of search results found was ".count($gold).".\n\n");

?>




Published @ searchlores in September 2004     Back to P2P     Back to Bots


Petit image

(c) III Millennium: [fravia+] , all rights reserved and reversed