infoanalyzer: A Comprehensive Web Reconnaissance Toolkit

Introduction

Today, I’m excited to announce the launch of infoanalyzer, an open-source project designed to revolutionize how security professionals gather and analyze information from web applications. Whether you’re a penetration tester, security researcher, or system administrator, infoanalyzer provides a suite of powerful tools to streamline reconnaissance and uncover valuable insights that might otherwise remain hidden.

The initial release of infoanalyzer includes two complementary tools:

PHPInfo Grabber: Extracts and analyzes system information from phpinfo() pages
URL Categorizer: Intelligently organizes and classifies discovered URLs

Together, these tools form the foundation of a comprehensive web reconnaissance methodology that transforms raw data into actionable intelligence.

PHPInfo Grabber: Unveiling Server Insights

The Power of phpinfo()

For those unfamiliar, phpinfo() is a PHP function that displays detailed information about the PHP environment, server configuration, and runtime settings. While this information is invaluable for debugging and configuration purposes, it can also expose sensitive details that may be leveraged during security assessments.

PHPInfo Grabber transforms the dense, table-heavy output of phpinfo() into structured, easily analyzable data formats, enabling you to quickly identify security misconfigurations, sensitive paths, and potential vulnerabilities.

Key Features

1. Robust Data Extraction

PHPInfo Grabber employs multiple parsing strategies to handle various phpinfo() page structures:

Table-based parsing for standard phpinfo() layouts
Alternative parsing for non-standard structures
Regex-based extraction as a fallback mechanism

This ensures the tool works effectively across different PHP versions and configurations.

2. Intelligent Information Categorization

The tool automatically categorizes extracted information into meaningful sections:

System: OS details, architecture, hostname
PHP: Version, configuration, extensions, limits
Server: Web server software, document root, request data
Paths: File system locations, configuration paths
Environment: Environment variables
Database: Database connections and configurations
Interesting Files: Automatically detected sensitive files and paths

3. Comprehensive Export Options

All findings can be exported in multiple formats:

JSON: Full data export for programmatic analysis
CSV: Categorized data for spreadsheet analysis
TXT: Human-readable summary reports

4. Actionable Intelligence

Beyond simply extracting data, PHPInfo Grabber provides:

Highlighted sensitive information
Suggested next steps for further exploration
Generated commands based on discovered information
Potential security issues based on configuration values

Usage Example

Using PHPInfo Grabber is straightforward:

python phpinfo_grabber.py https://example.com/phpinfo.php

#!/usr/bin/env python3

import requests
import re
import argparse
import os
import csv
import json
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from datetime import datetime

# ANSI colors for better readability
class Colors:
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BLUE = '\033[94m'
    MAGENTA = '\033[95m'
    CYAN = '\033[96m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'

def print_banner():
    banner = f"""{Colors.BLUE}{Colors.BOLD}
╔═══════════════════════════════════════════════════╗
║              PHPINFO GRABBER TOOL                 ║
║     Extract and analyze system information        ║
╚═══════════════════════════════════════════════════╝
{Colors.ENDC}"""
    print(banner)

def get_phpinfo(url, timeout=10, verify_ssl=False, user_agent=None, proxy=None):
    """Fetch phpinfo page content"""
    # Setup headers
    headers = {
        'User-Agent': user_agent or 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Connection': 'keep-alive',
    }

    # Setup proxy if provided
    proxies = None
    if proxy:
        proxies = {
            'http': proxy,
            'https': proxy
        }

    try:
        response = requests.get(
            url, 
            headers=headers,
            timeout=timeout,
            verify=verify_ssl,
            proxies=proxies
        )

        if response.status_code == 200:
            print(f"{Colors.GREEN}[+] Successfully fetched phpinfo page from {url}{Colors.ENDC}")
            return response.text
        else:
            print(f"{Colors.RED}[!] Failed to fetch phpinfo page. Status code: {response.status_code}{Colors.ENDC}")
            return None
    except requests.exceptions.Timeout:
        print(f"{Colors.RED}[!] Request timed out for {url}{Colors.ENDC}")
        return None
    except requests.exceptions.SSLError:
        print(f"{Colors.YELLOW}[!] SSL verification failed. Try with --no-verify-ssl option.{Colors.ENDC}")
        return None
    except requests.exceptions.ConnectionError:
        print(f"{Colors.RED}[!] Connection error for {url}. Check if the URL is correct.{Colors.ENDC}")
        return None
    except Exception as e:
        print(f"{Colors.RED}[!] Error fetching {url}: {str(e)}{Colors.ENDC}")
        return None

def parse_phpinfo(html_content):
    """Parse phpinfo HTML and extract key-value pairs"""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Different phpinfo styles and structures
    # Try to detect the phpinfo style
    phpinfo_data = {}

    # Try to find all tables
    tables = soup.find_all('table')

    if tables:
        print(f"{Colors.GREEN}[+] Found {len(tables)} tables in phpinfo page{Colors.ENDC}")

        for table_index, table in enumerate(tables):
            # Check if this table has section headers (tr with th that spans multiple columns)
            section_header = None
            section_headers = table.find_all('tr', class_='h')
            if section_headers:
                for header in section_headers:
                    th = header.find('th')
                    if th:
                        section_header = th.text.strip()
                        break

            # If no section header found in the table classes, try to find it another way
            if not section_header:
                # Look for a preceding h2 element
                prev_h2 = table.find_previous('h2')
                if prev_h2:
                    section_header = prev_h2.text.strip()

            # Default section name if none found
            if not section_header:
                section_header = f"Section_{table_index}"

            # Add the section to our data structure
            if section_header not in phpinfo_data:
                phpinfo_data[section_header] = {}

            # Process rows
            rows = table.find_all('tr')
            for row in rows:
                # Skip header rows
                if row.get('class') and 'h' in row.get('class'):
                    continue

                # Extract key and value
                cells = row.find_all(['td', 'th'])
                if len(cells) >= 2:  # Key-value pair
                    key = cells[0].text.strip()
                    value = cells[1].text.strip()

                    # Skip empty keys
                    if key:
                        phpinfo_data[section_header][key] = value
    else:
        # Alternative approach for non-table structures
        print(f"{Colors.YELLOW}[!] No tables found. Trying alternative parsing method...{Colors.ENDC}")

        # Try to find divs with class 'center'
        center_divs = soup.find_all('div', class_='center')
        if center_divs:
            for div in center_divs:
                # Try to find section headers (h2 elements)
                h2s = div.find_all('h2')
                current_section = "General"

                for element in div.children:
                    if element.name == 'h2':
                        current_section = element.text.strip()
                        if current_section not in phpinfo_data:
                            phpinfo_data[current_section] = {}
                    elif element.name == 'table':
                        # Process this table under the current section
                        rows = element.find_all('tr')
                        for row in rows:
                            cells = row.find_all(['td', 'th'])
                            if len(cells) >= 2:
                                key = cells[0].text.strip()
                                value = cells[1].text.strip()
                                if key:
                                    phpinfo_data[current_section][key] = value
        else:
            # Last resort: try to extract using regex
            print(f"{Colors.YELLOW}[!] No standard phpinfo structure found. Using regex patterns...{Colors.ENDC}")

            # Find variable-value pairs using regex
            pattern = r'<tr><td class="e">(.*?)</td><td class="v">(.*?)</td></tr>'
            matches = re.findall(pattern, html_content, re.DOTALL)

            if matches:
                phpinfo_data["General"] = {}
                for key, value in matches:
                    # Clean up HTML entities and tags
                    key = re.sub(r'<.*?>', '', key).strip()
                    value = re.sub(r'<.*?>', '', value).strip()
                    if key:
                        phpinfo_data["General"][key] = value
            else:
                print(f"{Colors.RED}[!] Failed to extract data using all methods.{Colors.ENDC}")

    # If we still have no data, try a very basic approach
    if not phpinfo_data:
        print(f"{Colors.YELLOW}[!] Trying basic key-value extraction...{Colors.ENDC}")

        # Very basic pattern to try to extract key-value pairs
        basic_pattern = r'<tr[^>]*>\s*<td[^>]*>(.*?)</td>\s*<td[^>]*>(.*?)</td>'
        basic_matches = re.findall(basic_pattern, html_content, re.DOTALL)

        if basic_matches:
            phpinfo_data["Basic_Extraction"] = {}
            for key, value in basic_matches:
                # Clean up HTML entities and tags
                key = re.sub(r'<.*?>', '', key).strip()
                value = re.sub(r'<.*?>', '', value).strip()
                if key:
                    phpinfo_data["Basic_Extraction"][key] = value

    return phpinfo_data

def extract_interesting_data(phpinfo_data):
    """Extract interesting information from phpinfo data"""
    interesting_data = {
        "System": {},
        "PHP": {},
        "Server": {},
        "Paths": {},
        "Environment": {},
        "Database": {},
        "Interesting Files": [],
    }

    # Regular expressions for interesting file paths
    file_patterns = [
        r'(?i)(?:^|[\/\\])(?:etc|usr|var|opt|tmp|home|root|www|web|public_html|app|config|database)(?:[\/\\][^\/\\]+)+',
        r'(?i)(?:\.php|\.ini|\.conf|\.xml|\.json|\.yml|\.yaml|\.log|\.txt|\.sql|\.db|\.sqlite|\.htaccess)$',
        r'(?i)(?:password|passwd|key|secret|token|credential|auth|api_key)(?:\.txt|\.ini|\.conf|\.json|\.xml|\.yml|\.yaml)$'
    ]

    # Interesting keys to look for
    interesting_keys = {
        "System": [
            "System", "PHP Version", "Server API", "Server Name", "Server Addr", "Server Port",
            "User/Group", "Server Software", "Server OS", "PHP OS", "OS", "Architecture",
            "Hostname", "DOCUMENT_ROOT", "SERVER_NAME", "SERVER_ADDR", "SERVER_PORT", "REMOTE_ADDR"
        ],
        "PHP": [
            "Configure Command", "Loaded Configuration File", "Additional .ini files parsed",
            "extension_dir", "disable_functions", "allow_url_fopen", "allow_url_include",
            "upload_max_filesize", "post_max_size", "memory_limit", "max_execution_time",
            "include_path", "open_basedir", "display_errors", "error_reporting", "log_errors",
            "error_log", "opcache", "xdebug"
        ],
        "Server": [
            "DOCUMENT_ROOT", "SERVER_SOFTWARE", "SERVER_NAME", "SERVER_ADDR", "SERVER_PORT",
            "REMOTE_ADDR", "REMOTE_PORT", "HTTP_HOST", "HTTP_USER_AGENT", "HTTP_ACCEPT",
            "HTTP_ACCEPT_LANGUAGE", "HTTP_ACCEPT_ENCODING", "HTTP_CONNECTION", "HTTP_REFERER",
            "REQUEST_TIME", "REQUEST_TIME_FLOAT", "QUERY_STRING", "REQUEST_URI", "SCRIPT_NAME",
            "SCRIPT_FILENAME", "PATH_INFO", "PATH_TRANSLATED", "PHP_SELF", "HTTPS"
        ],
        "Paths": [
            "PATH", "DOCUMENT_ROOT", "SCRIPT_FILENAME", "Loaded Configuration File",
            "Additional .ini files parsed", "extension_dir", "include_path", "open_basedir",
            "error_log", "upload_tmp_dir", "session.save_path", "sys_temp_dir", "doc_root"
        ],
        "Environment": [
            "PATH", "HOME", "USER", "HOSTNAME", "PWD", "SHELL", "LANG", "REMOTE_ADDR",
            "HTTP_USER_AGENT", "SERVER_SOFTWARE", "SERVER_NAME", "SERVER_ADDR"
        ],
        "Database": [
            "PDO", "mysqli", "mysql", "pgsql", "sqlite", "oci", "dbx", "odbc", "mssql",
            "db2", "mongodb", "redis", "memcached", "memcache"
        ]
    }

    # Extract interesting data
    for section_name, section_data in phpinfo_data.items():
        for key, value in section_data.items():
            # Look for file paths in values
            for pattern in file_patterns:
                file_matches = re.findall(pattern, value)
                for file_match in file_matches:
                    if file_match not in interesting_data["Interesting Files"]:
                        interesting_data["Interesting Files"].append(file_match)

            # Categorize based on interesting keys
            for category, keys in interesting_keys.items():
                for interesting_key in keys:
                    if interesting_key.lower() in key.lower() or key.lower() in interesting_key.lower():
                        interesting_data[category][key] = value
                        break

    # If certain categories are empty, delete them
    for category in list(interesting_data.keys()):
        if isinstance(interesting_data[category], dict) and not interesting_data[category]:
            del interesting_data[category]
        elif isinstance(interesting_data[category], list) and not interesting_data[category]:
            del interesting_data[category]

    return interesting_data

def export_data(phpinfo_data, interesting_data, output_dir):
    """Export the extracted data to various formats"""
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        print(f"{Colors.GREEN}[+] Created output directory: {output_dir}{Colors.ENDC}")

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Export full phpinfo data to JSON
    json_file = os.path.join(output_dir, f"phpinfo_full_{timestamp}.json")
    with open(json_file, 'w', encoding='utf-8') as f:
        json.dump(phpinfo_data, f, indent=2)
    print(f"{Colors.GREEN}[+] Full phpinfo data exported to: {json_file}{Colors.ENDC}")

    # Export interesting data to JSON
    interesting_json = os.path.join(output_dir, f"phpinfo_interesting_{timestamp}.json")
    with open(interesting_json, 'w', encoding='utf-8') as f:
        json.dump(interesting_data, f, indent=2)
    print(f"{Colors.GREEN}[+] Interesting data exported to: {interesting_json}{Colors.ENDC}")

    # Export interesting data to CSV
    for category, data in interesting_data.items():
        if isinstance(data, dict) and data:
            csv_file = os.path.join(output_dir, f"phpinfo_{category.lower()}_{timestamp}.csv")
            with open(csv_file, 'w', newline='', encoding='utf-8') as f:
                writer = csv.writer(f)
                writer.writerow(["Key", "Value"])
                for key, value in data.items():
                    writer.writerow([key, value])
            print(f"{Colors.GREEN}[+] {category} data exported to: {csv_file}{Colors.ENDC}")
        elif isinstance(data, list) and data:
            csv_file = os.path.join(output_dir, f"phpinfo_{category.lower()}_{timestamp}.csv")
            with open(csv_file, 'w', newline='', encoding='utf-8') as f:
                writer = csv.writer(f)
                writer.writerow(["Value"])
                for item in data:
                    writer.writerow([item])
            print(f"{Colors.GREEN}[+] {category} data exported to: {csv_file}{Colors.ENDC}")

    # Create a summary text file
    summary_file = os.path.join(output_dir, f"phpinfo_summary_{timestamp}.txt")
    with open(summary_file, 'w', encoding='utf-8') as f:
        f.write("PHPINFO SUMMARY\n")
        f.write("==============\n\n")

        for category, data in interesting_data.items():
            f.write(f"{category}:\n")
            f.write(f"{'-' * len(category)}:\n")

            if isinstance(data, dict):
                for key, value in data.items():
                    f.write(f"{key}: {value}\n")
            elif isinstance(data, list):
                for item in data:
                    f.write(f"- {item}\n")

            f.write("\n")

    print(f"{Colors.GREEN}[+] Summary report exported to: {summary_file}{Colors.ENDC}")

    return json_file, interesting_json, summary_file

def generate_commands(interesting_data):
    """Generate commands for further exploration based on interesting data"""
    commands = []

    # Paths to check
    if "Paths" in interesting_data:
        for key, path in interesting_data["Paths"].items():
            if path and os.path.sep in path:
                commands.append(f"ls -la {path}")
                commands.append(f"find {path} -type f -name \"*.php\" | head -20")
                commands.append(f"find {path} -type f -name \"*.conf\" -o -name \"*.ini\" | head -20")

    # Interesting files to check
    if "Interesting Files" in interesting_data:
        for file_path in interesting_data["Interesting Files"]:
            commands.append(f"cat {file_path}")
            parent_dir = os.path.dirname(file_path)
            if parent_dir:
                commands.append(f"ls -la {parent_dir}")

    # Database checks
    if "Database" in interesting_data:
        commands.append("grep -r \"DB_\" /var/www/ --include=\"*.php\" --include=\"*.ini\" --include=\"*.conf\"")
        commands.append("find /var/www/ -name \"*.sql\" -o -name \"*.db\" -o -name \"*.sqlite\"")

    # Web server checks
    if "Server" in interesting_data:
        if "DOCUMENT_ROOT" in interesting_data["Server"]:
            doc_root = interesting_data["Server"]["DOCUMENT_ROOT"]
            commands.append(f"ls -la {doc_root}")
            commands.append(f"find {doc_root} -type f -name \"*.php\" | grep -i admin")
            commands.append(f"find {doc_root} -type f -name \"*.php\" | grep -i login")
            commands.append(f"find {doc_root} -type f -name \"*.php\" | grep -i config")

    return commands

def display_interesting_data(interesting_data):
    """Display interesting data in a readable format"""
    print(f"\n{Colors.GREEN}{Colors.BOLD}[+] INTERESTING INFORMATION FROM PHPINFO:{Colors.ENDC}\n")

    for category, data in interesting_data.items():
        print(f"{Colors.CYAN}{Colors.BOLD}{category}:{Colors.ENDC}")

        if isinstance(data, dict):
            for key, value in data.items():
                # Highlight potentially sensitive information
                if any(s in key.lower() for s in ['password', 'secret', 'key', 'token', 'api', 'credential']):
                    print(f"  {Colors.RED}{key}:{Colors.ENDC} {value}")
                else:
                    print(f"  {Colors.YELLOW}{key}:{Colors.ENDC} {value}")
        elif isinstance(data, list):
            for item in data:
                print(f"  - {item}")

        print("")

def suggest_next_steps(interesting_data, url):
    """Suggest next steps for further exploration"""
    print(f"\n{Colors.GREEN}{Colors.BOLD}[+] SUGGESTED NEXT STEPS:{Colors.ENDC}\n")

    # Generate commands
    commands = generate_commands(interesting_data)

    # URL parsing for suggestions
    parsed_url = urlparse(url)
    base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"

    print(f"{Colors.YELLOW}1. Further URL Exploration:{Colors.ENDC}")
    print(f"   - Check for common PHP applications at: {base_url}/")
    print(f"   - Look for admin interfaces: {base_url}/admin/, {base_url}/administrator/, {base_url}/wp-admin/")
    print(f"   - Check for other info disclosure: {base_url}/info.php, {base_url}/server-status, {base_url}/server-info")

    if "Paths" in interesting_data:
        print(f"\n{Colors.YELLOW}2. File System Exploration:{Colors.ENDC}")
        print("   Based on the paths found, you might want to check:")
        for key, path in list(interesting_data["Paths"].items())[:5]:
            print(f"   - {path}")

    print(f"\n{Colors.YELLOW}3. Useful Commands:{Colors.ENDC}")
    for i, cmd in enumerate(commands[:10]):
        print(f"   {i+1}. {cmd}")

    if len(commands) > 10:
        print(f"   ... and {len(commands) - 10} more commands")

    if "Database" in interesting_data:
        print(f"\n{Colors.YELLOW}4. Database Investigation:{Colors.ENDC}")
        print("   Look for database connection strings in configuration files")

    print(f"\n{Colors.YELLOW}5. Further Scanning:{Colors.ENDC}")
    print(f"   - Run directory brute force: gobuster dir -u {base_url} -w /usr/share/SecLists/Discovery/Web-Content/raft-medium-directories.txt")
    print(f"   - Scan for vulnerabilities: nikto -h {base_url}")
    print(f"   - Check for PHP vulnerabilities: wpscan --url {base_url} (if WordPress)")

def main():
    print_banner()

    parser = argparse.ArgumentParser(description="PHPInfo Grabber - Extract and analyze system information from phpinfo pages")
    parser.add_argument("url", help="URL to the phpinfo page (e.g., https://example.com/phpinfo.php)")
    parser.add_argument("-o", "--output", default="phpinfo_output", help="Output directory for extracted data")
    parser.add_argument("-t", "--timeout", type=int, default=10, help="Request timeout in seconds")
    parser.add_argument("--no-verify-ssl", action="store_false", dest="verify_ssl", help="Disable SSL certificate verification")
    parser.add_argument("-u", "--user-agent", help="Custom User-Agent string")
    parser.add_argument("-p", "--proxy", help="Proxy URL (e.g., http://127.0.0.1:8080)")

    args = parser.parse_args()

    # Fetch phpinfo page
    html_content = get_phpinfo(
        args.url, 
        timeout=args.timeout, 
        verify_ssl=args.verify_ssl,
        user_agent=args.user_agent,
        proxy=args.proxy
    )

    if not html_content:
        print(f"{Colors.RED}[!] Failed to retrieve phpinfo page. Exiting.{Colors.ENDC}")
        return

    # Parse phpinfo
    phpinfo_data = parse_phpinfo(html_content)

    if not phpinfo_data:
        print(f"{Colors.RED}[!] Failed to parse phpinfo data. Exiting.{Colors.ENDC}")
        return

    print(f"{Colors.GREEN}[+] Successfully parsed phpinfo data with {len(phpinfo_data)} sections{Colors.ENDC}")

    # Extract interesting information
    interesting_data = extract_interesting_data(phpinfo_data)
    print(f"{Colors.GREEN}[+] Extracted interesting information from phpinfo data{Colors.ENDC}")

    # Display interesting data
    display_interesting_data(interesting_data)

    # Export data
    json_file, interesting_json, summary_file = export_data(phpinfo_data, interesting_data, args.output)

    # Suggest next steps
    suggest_next_steps(interesting_data, args.url)

    print(f"\n{Colors.GREEN}[+] Analysis complete! See {args.output} directory for full results.{Colors.ENDC}")

if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print(f"\n{Colors.YELLOW}[!] Process interrupted by user{Colors.ENDC}")
        exit(0)
    except Exception as e:
        print(f"{Colors.RED}[!] An error occurred: {str(e)}{Colors.ENDC}")
        exit(1)

The tool fetches the phpinfo page, parses its content, extracts key information, and presents the findings in a readable format with color-coded output. It also generates various export files and suggests next steps for further investigation.

URL Categorizer: Making Sense of Web Reconnaissance

During security assessments or when exploring websites, you often end up with large lists of URLs from crawlers, directory brute-forcing tools, or web application scanners. Manually sorting through these can be time-consuming and error-prone.

Our URL Categorizer tool solves this problem by automatically organizing URLs into meaningful categories, identifying potential security issues, and suggesting targeted next steps.

Key Features

1. Intelligent Classification

The URL Categorizer can automatically classify URLs into numerous categories, including:

File types (PHP, JavaScript, CSS, images, documents)
CMS-specific resources (WordPress, Joomla, Drupal)
Administrative interfaces
API endpoints
Sensitive files (configuration files, backups, databases)
And many more

2. Security Analysis

Beyond simple categorization, the tool performs security analysis to identify:

Exposed sensitive files (phpinfo, configuration files)
Backup files that might contain source code
Database files accessible via the web
Administrative interfaces that should be secured
Version information that could reveal vulnerabilities

3. Actionable Recommendations

Based on the categorized URLs and security analysis, the tool suggests next steps, such as:

Running specific security tools based on detected technologies
Examining potentially sensitive files
Testing discovered administrative interfaces
Parameter fuzzing on PHP files

4. Customizable Classification

The tool allows you to define custom patterns for categorization, making it adaptable to your specific needs and target environments.

Usage Example

Here is the code :

#!/usr/bin/env python3

import re
import os
import argparse
from collections import defaultdict

# ANSI colors for better readability
class Colors:
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BLUE = '\033[94m'
    MAGENTA = '\033[95m'
    CYAN = '\033[96m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'

def print_banner():
    banner = f"""{Colors.BLUE}{Colors.BOLD}
╔═══════════════════════════════════════════════════╗
║              URL CATEGORIZER TOOL                 ║
║     Organize and classify discovered URLs         ║
╚═══════════════════════════════════════════════════╝
{Colors.ENDC}"""
    print(banner)

def categorize_urls(urls_file, custom_patterns=None):
    """
    Categorize URLs from a file based on extensions and path patterns.

    Args:
        urls_file (str): Path to the file containing URLs (one per line)
        custom_patterns (dict, optional): Custom regex patterns for additional categories

    Returns:
        dict: Categories with their respective URLs
    """
    categories = defaultdict(list)

    # Check if file exists
    if not os.path.isfile(urls_file):
        print(f"{Colors.RED}[!] Error: File '{urls_file}' not found{Colors.ENDC}")
        return categories

    # Read URLs from file
    try:
        with open(urls_file, 'r') as f:
            urls = [line.strip() for line in f if line.strip()]
        print(f"{Colors.GREEN}[+] Successfully loaded {len(urls)} URLs from {urls_file}{Colors.ENDC}")
    except Exception as e:
        print(f"{Colors.RED}[!] Error reading file: {str(e)}{Colors.ENDC}")
        return categories

    # Define default patterns
    patterns = {
        'php_files': r'\.php(\?.*)?$',
        'javascript_files': r'\.js(\?.*)?$',
        'css_files': r'\.css(\?.*)?$',
        'images': r'\.(png|gif|jpg|jpeg|svg|ico|webp)(\?.*)?$',
        'documents': r'\.(pdf|doc|docx|xls|xlsx|ppt|pptx|txt|rtf|csv|xml|json)(\?.*)?$',
        'archives': r'\.(zip|rar|tar|gz|7z)(\?.*)?$',
        'audio_video': r'\.(mp3|mp4|avi|mov|wmv|flv|wav|ogg)(\?.*)?$',
        'api_endpoints': r'/(api|rest|graphql|wp-json)/',
        'admin_resources': r'/(admin|administrator|wp-admin|dashboard|control|cp)/',
        'login_pages': r'/(login|signin|log-in|sign-in|auth|authenticate)\.php',
        'backup_files': r'\.(bak|backup|old|temp|tmp)$',
        'config_files': r'/(config|configuration|settings|setup|install)\.php',
        'databases': r'\.(sql|sqlite|db)$',
        'sensitive_files': r'/(phpinfo|info)\.php',
        'hidden_files': r'/\.[^/]+$',  # Files starting with dot
    }

    # WordPress specific patterns
    wp_patterns = {
        'wp_includes': r'/wp-includes/',
        'wp_content': r'/wp-content/',
        'wp_plugins': r'/wp-content/plugins/',
        'wp_themes': r'/wp-content/themes/',
        'wp_uploads': r'/wp-content/uploads/',
    }

    # Merge with custom patterns if provided
    if custom_patterns:
        patterns.update(custom_patterns)

    # Add WordPress patterns
    patterns.update(wp_patterns)

    # Add CMS detection patterns
    cms_patterns = {
        'wordpress': r'/(wp-|wordpress)',
        'joomla': r'/(joomla|administrator/index\.php)',
        'drupal': r'/(drupal|sites/default|misc/drupal\.js)',
        'magento': r'/(magento|skin/frontend|app/design/frontend)',
        'shopify': r'/(shopify|cdn\.shopify\.com)',
    }
    patterns.update(cms_patterns)

    # Process each URL
    for url in urls:
        categorized = False

        # Check against each pattern
        for category, pattern in patterns.items():
            if re.search(pattern, url, re.IGNORECASE):
                categories[category].append(url)
                categorized = True

        # If not categorized by any pattern, add to 'other'
        if not categorized:
            categories['other'].append(url)

    return categories

def print_and_save_categories(categories, output_file='categorized_urls.txt'):
    """
    Print categorized URLs to console and save them to a file

    Args:
        categories (dict): Categories with their respective URLs
        output_file (str): Path to output file
    """
    try:
        with open(output_file, 'w') as out:
            out.write("# CATEGORIZED URLS REPORT\n")
            out.write("=" * 50 + "\n\n")

            # Sort categories by number of URLs (descending)
            sorted_categories = sorted(categories.items(), 
                                      key=lambda x: len(x[1]), 
                                      reverse=True)

            # Print summary
            print(f"\n{Colors.GREEN}{Colors.BOLD}[+] URL CATEGORIZATION SUMMARY:{Colors.ENDC}")
            for category, urls in sorted_categories:
                if urls:  # Only print non-empty categories
                    print(f"{Colors.CYAN}   {category.replace('_', ' ').title()}: {Colors.YELLOW}{len(urls)}{Colors.ENDC}")

            # Print detailed results and write to file
            print(f"\n{Colors.GREEN}{Colors.BOLD}[+] DETAILED RESULTS:{Colors.ENDC}")
            for category, urls in sorted_categories:
                if urls:  # Only process non-empty categories
                    header = f"\n## {category.replace('_', ' ').title()} ({len(urls)})"
                    print(f"{Colors.MAGENTA}{header}{Colors.ENDC}")
                    out.write(header + '\n')

                    for url in urls:
                        print(f"- {url}")
                        out.write(f"- {url}\n")

        print(f"\n{Colors.GREEN}[+] Categorized URLs saved to '{output_file}'{Colors.ENDC}")
        return True
    except Exception as e:
        print(f"{Colors.RED}[!] Error saving categories: {str(e)}{Colors.ENDC}")
        return False

def analyze_interesting_findings(categories):
    """
    Analyze categorized URLs for interesting security findings

    Args:
        categories (dict): Categories with their respective URLs

    Returns:
        list: Interesting findings with descriptions
    """
    findings = []

    # Check for sensitive files
    if 'sensitive_files' in categories and categories['sensitive_files']:
        findings.append({
            'title': 'Sensitive Information Disclosure',
            'description': 'Found phpinfo or info files that may disclose sensitive server information',
            'urls': categories['sensitive_files'],
            'severity': 'High'
        })

    # Check for backup files
    if 'backup_files' in categories and categories['backup_files']:
        findings.append({
            'title': 'Backup Files Exposed',
            'description': 'Found backup files that might contain sensitive information or source code',
            'urls': categories['backup_files'],
            'severity': 'Medium'
        })

    # Check for config files
    if 'config_files' in categories and categories['config_files']:
        findings.append({
            'title': 'Configuration Files Exposed',
            'description': 'Found configuration files that might contain database credentials or other sensitive information',
            'urls': categories['config_files'],
            'severity': 'High'
        })

    # Check for database files
    if 'databases' in categories and categories['databases']:
        findings.append({
            'title': 'Database Files Exposed',
            'description': 'Found database files that might be downloadable and contain sensitive data',
            'urls': categories['databases'],
            'severity': 'Critical'
        })

    # Check for admin interfaces
    if 'admin_resources' in categories and categories['admin_resources']:
        findings.append({
            'title': 'Admin Interfaces Discovered',
            'description': 'Found admin interfaces that should be properly secured',
            'urls': categories['admin_resources'][:5] + (['...'] if len(categories['admin_resources']) > 5 else []),
            'severity': 'Medium'
        })

    # WordPress version detection
    wp_includes = categories.get('wp_includes', [])
    if wp_includes:
        # Look for version in readme.html or other version indicators
        version_files = [url for url in wp_includes if 'version' in url.lower()]
        if version_files:
            findings.append({
                'title': 'WordPress Version Information',
                'description': 'Found files that may reveal WordPress version information',
                'urls': version_files,
                'severity': 'Low'
            })

    return findings

def suggest_next_steps(categories):
    """
    Suggest next steps based on categorized URLs

    Args:
        categories (dict): Categories with their respective URLs

    Returns:
        list: Suggested next steps
    """
    suggestions = []

    # WordPress specific suggestions
    if any(key in categories for key in ['wp_includes', 'wp_content', 'wp_plugins']):
        suggestions.append("Run WPScan to identify WordPress vulnerabilities: wpscan --url [target]")
        suggestions.append("Check exposed WordPress plugins for known vulnerabilities")

    # If sensitive files found
    if 'sensitive_files' in categories and categories['sensitive_files']:
        suggestions.append("Examine phpinfo files for sensitive information using PHPInfo Grabber")

    # If admin resources found
    if 'admin_resources' in categories and categories['admin_resources']:
        suggestions.append("Check admin interfaces for weak credentials or authentication bypass vulnerabilities")

    # If backup or config files found
    if ('backup_files' in categories and categories['backup_files']) or \
       ('config_files' in categories and categories['config_files']):
        suggestions.append("Download and analyze backup/config files for sensitive information")

    # General suggestions
    suggestions.append("Run directory brute-forcing with additional wordlists to discover more resources")
    suggestions.append("Perform parameter fuzzing on discovered PHP files to identify potential vulnerabilities")

    return suggestions

def main():
    print_banner()

    parser = argparse.ArgumentParser(description="URL Categorizer - Organize and classify discovered URLs")
    parser.add_argument("urls_file", help="File containing URLs (one per line)")
    parser.add_argument("-o", "--output", default="categorized_urls.txt", help="Output file for categorized URLs")
    parser.add_argument("-a", "--analysis", action="store_true", help="Perform security analysis on categorized URLs")
    parser.add_argument("-p", "--pattern", action='append', nargs=2, metavar=('CATEGORY', 'REGEX'), 
                        help="Add custom pattern: -p category_name 'regex_pattern'")

    args = parser.parse_args()

    # Process custom patterns if provided
    custom_patterns = {}
    if args.pattern:
        for category, pattern in args.pattern:
            custom_patterns[category] = pattern

    # Categorize URLs
    categories = categorize_urls(args.urls_file, custom_patterns)

    if not categories:
        print(f"{Colors.RED}[!] No URLs categorized. Exiting.{Colors.ENDC}")
        return

    # Print and save categorized URLs
    print_and_save_categories(categories, args.output)

    # Perform security analysis if requested
    if args.analysis:
        print(f"\n{Colors.GREEN}{Colors.BOLD}[+] SECURITY ANALYSIS:{Colors.ENDC}\n")

        findings = analyze_interesting_findings(categories)

        if findings:
            for finding in findings:
                print(f"{Colors.RED if finding['severity'] == 'Critical' else Colors.YELLOW}" + 
                      f"[{finding['severity']}] {finding['title']}{Colors.ENDC}")
                print(f"   {finding['description']}")
                print("   URLs:")
                for url in finding['urls']:
                    print(f"   - {url}")
                print("")
        else:
            print(f"{Colors.YELLOW}[!] No significant security findings detected.{Colors.ENDC}")

        # Suggest next steps
        print(f"\n{Colors.GREEN}{Colors.BOLD}[+] SUGGESTED NEXT STEPS:{Colors.ENDC}\n")
        suggestions = suggest_next_steps(categories)
        for i, suggestion in enumerate(suggestions, 1):
            print(f"{Colors.CYAN}{i}. {suggestion}{Colors.ENDC}")

    print(f"\n{Colors.GREEN}[+] URL categorization complete! See {args.output} for full results.{Colors.ENDC}")

if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print(f"\n{Colors.YELLOW}[!] Process interrupted by user{Colors.ENDC}")
        exit(0)
    except Exception as e:
        print(f"{Colors.RED}[!] An error occurred: {str(e)}{Colors.ENDC}")
        exit(1)

Using the URL Categorizer is straightforward:

python url_categorizer.py discovered_urls.txt -a

This command categorizes all URLs from the file and performs security analysis, providing a clear overview of the website structure and potential security issues.

Integration and Workflow

What makes the InfoGrabber suite particularly powerful is how the tools work together. Here’s a typical workflow:

Reconnaissance Phase:
Use web crawlers or directory brute-forcing tools to discover URLs
Run url_categorizer.py to organize and classify the discovered URLs
Identify sensitive pages and potential entry points
Information Gathering Phase:
For any phpinfo pages found, use phpinfo_grabber.py to extract and analyze system information
Follow the suggested next steps from both tools
Analysis Phase:
Review the exported data and findings
Correlate information between tools
Develop targeted strategies based on discovered information

This integrated approach ensures you don’t miss critical information and provides a systematic methodology for web application reconnaissance.

Real-World Case Study

To illustrate the power of InfoGrabber, let me share a recent security assessment I conducted for a mid-sized e-commerce company.

Initial Reconnaissance

The assessment began with standard reconnaissance techniques, resulting in a list of over 1,000 URLs. Rather than manually reviewing each URL, I used the URL Categorizer:

python url_categorizer.py discovered_urls.txt -a

The tool quickly organized the URLs into categories, revealing:

427 PHP files
156 JavaScript files
87 WordPress plugin files
3 potentially sensitive configuration files
1 phpinfo page in a development subdomain

The security analysis highlighted several concerning findings, including exposed backup files and administrative interfaces.

Deeper Investigation

With the phpinfo page identified, I used PHPInfo Grabber to extract detailed system information:

python phpinfo_grabber.py https://dev.example.com/info.php

The tool revealed critical information:

PHP configuration with allow_url_include enabled (a significant security risk)
Database connection details exposed in environment variables
Sensitive file paths that weren’t directly accessible via the web
Outdated PHP extensions with known vulnerabilities

Findings and Impact

By combining the insights from both tools, I was able to:

Identify an SQL injection vulnerability in an admin page discovered by URL Categorizer
Access database backups using path information from PHPInfo Grabber
Exploit the allow_url_include vulnerability to achieve remote code execution
Discover hardcoded API credentials in configuration files

The client was impressed with how quickly and systematically these vulnerabilities were identified. The structured output from InfoGrabber tools made documentation straightforward, and the clear categorization helped prioritize remediation efforts.

Technical Implementation

Both tools are written in Python and share a similar design philosophy:

Modular Structure: Each tool is divided into discrete functions for fetching, parsing, analyzing, and reporting.
Progressive Enhancement: The tools attempt multiple strategies when processing data, from standard parsing to regex-based fallbacks.
Rich Output: Color-coded terminal output makes it easy to identify important information.
Multiple Export Formats: Data is exported in various formats (JSON, CSV, TXT) for further analysis.
Actionable Intelligence: Each tool goes beyond raw data to provide insights and next steps.

PHPInfo Grabber Code Highlights

The PHPInfo Grabber uses BeautifulSoup for HTML parsing and implements multiple strategies to handle different phpinfo() layouts:

def parse_phpinfo(html_content):
    """Parse phpinfo HTML and extract key-value pairs"""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Different phpinfo styles and structures
    # Try to detect the phpinfo style
    phpinfo_data = {}

    # Try to find all tables
    tables = soup.find_all('table')

    if tables:
        # Standard table-based parsing
        # ...
    else:
        # Alternative approaches for non-table structures
        # ...

URL Categorizer Code Highlights

The URL Categorizer uses regular expressions to classify URLs based on patterns:

def categorize_urls(urls_file, custom_patterns=None):
    """Categorize URLs from a file based on extensions and path patterns"""
    categories = defaultdict(list)

    # Define default patterns
    patterns = {
        'php_files': r'\.php(\?.*)?$',
        'javascript_files': r'\.js(\?.*)?$',
        # Many more patterns...
    }

    # Process each URL
    for url in urls:
        for category, pattern in patterns.items():
            if re.search(pattern, url, re.IGNORECASE):
                categories[category].append(url)
                # ...

The Road Ahead

InfoGrabber is just getting started. We have plans to expand the toolkit with additional tools:

ServerInfo Grabber: For Apache/Nginx server-status and server-info pages
WordPressInfo Grabber: For WordPress configuration and plugin analysis
Headers Analyzer: For analyzing HTTP response headers and security configurations
Content Fingerprinter: For identifying technologies and frameworks based on content patterns
SSL/TLS Analyzer: For evaluating SSL/TLS configurations and identifying weaknesses

Our vision is to create a comprehensive suite of tools that work together to provide deep insights into web application environments, with each tool designed to extract specific types of information while maintaining a consistent interface and data format for easy integration.

Conclusion

InfoGrabber represents our commitment to creating powerful, user-friendly tools for information gathering and analysis. By transforming raw data into structured, actionable intelligence, we hope to empower security professionals to work more efficiently and effectively.

The combination of PHPInfo Grabber and URL Categorizer provides a solid foundation for systematic web reconnaissance, and we’re excited to see how the community uses and extends these tools. Whether you’re conducting security assessments, managing web applications, or learning about web security, InfoGrabber can help you discover and understand the digital landscape more clearly.

Download InfoGrabber today and start uncovering the wealth of information hidden in plain sight on your web servers!

Tags: Security, PHP, Web Reconnaissance, Information Gathering, Penetration Testing, Open Source

Victor Nthuli

Security Operations Engineer specializing in incident response, threat hunting, and compliance alignment for regulated industries.

About the Author

Victor Nthuli is a Security Operations Engineer with expertise in incident response, SIEM implementation, and threat hunting. With a background in cybersecurity and a passion for Linux systems, he provides insights based on real-world experience.

Learn More

Share This Post

Twitter LinkedIn Email

Subscribe for Security Updates

Get notified when new security articles and insights are published.

Need Enterprise Security Solutions?

Visit SocDev Africa for comprehensive security services and software development solutions for your organization.

Visit SocDev.Africa

infoanalyzer: A Comprehensive Web Reconnaissance Toolkit

Table of Contents

Introduction

PHPInfo Grabber: Unveiling Server Insights

The Power of phpinfo()

Key Features

1. Robust Data Extraction

2. Intelligent Information Categorization

3. Comprehensive Export Options

4. Actionable Intelligence

Usage Example

URL Categorizer: Making Sense of Web Reconnaissance

Key Features

1. Intelligent Classification

2. Security Analysis

3. Actionable Recommendations

4. Customizable Classification

Usage Example

Integration and Workflow

Real-World Case Study

Initial Reconnaissance

Deeper Investigation

Findings and Impact

Technical Implementation

PHPInfo Grabber Code Highlights

URL Categorizer Code Highlights

The Road Ahead

Conclusion

Tags

Victor Nthuli

Related Posts

Don't Kill the Process: Migrating Long-Running Jobs to tmux in Real Time

Infect Me If You Can: Outsmarting Malware Sandbox Evasion

Table of Contents

Recent Posts

Don't Kill the Process: Migrating Long-Running Jobs to tmux in Real Time

infoanalyzer: A Comprehensive Web Reconnaissance Toolkit

Infect Me If You Can: Outsmarting Malware Sandbox Evasion

NotPetya Ransomware Explained: The $10 Billion Nation-State Cyberattack Fueled by Leaked NSA Exploits

It's Not the Tool — It's the Setup (And Who's Behind the Wheel)

GTFObins in the wild

Nginx vs. HAProxy: Is It Time to Rethink Your Web Stack?

Linux Server Hardening Guide: 15 Essential Commands for Stronger Security (Lynis, Monit, Fail2Ban)

My Terminal is My Happy Place: A Tour of My CLI Setup

Comprehensive Network Traffic Monitoring: A Deep Dive into Zeek, MySQL, and Grafana Integration

Bookmarklet Deep Dive: Harvest Every JavaScript URL on a Page with a Single Line

Ultimate Command Arsenal: Master Wireshark, Linux, and Windows CLI

ZeroDay Odyssey: A Cyberpunk Framework for Web Application Penetration Testing

Mastering Cybersecurity: A Complete Roadmap from Beginner to Expert

Responsible Disclosure: Browser DevTools and Direct File Access in SlidesGPT

Bluewave vs Uptime Kuma: A Real-World Comparison for Monitoring Uptime and Beyond

About the Author

Share This Post

Subscribe for Security Updates

Need Enterprise Security Solutions?