Bot – Master Coder X

I have been working on a project, and I wanted to build a simple web crawler to grab subdomains given a parent domain. It turns out that this is pretty simple using DuckDuckGo and BeautifulSoup to parse the HTML.

Using Google can be a pain because all the HTML is pretty much rendered by JavaScript, which makes it difficult to be parsed. DuckDuckGo makes it much easier by offering an HTML subdomain that provides mostly HTML that can be easily parsed.

Warning: If you run the crawler too many times, then DuckDuckGo will put a 15 to 30 minute block on your IP address. The solution is to sleep between calls or use a sufficient amount of proxy servers.

The Code

import requests
import bs4

def duckduckgo_search_results(q):
    # Make the bot look like it's using a browser.
    headers = {
        'User-Agent': "Mozilla/5.0"
    }

    # Search duckduckgo via HTML
    url = "https://html.duckduckgo.com/html/?q=" + str(q)

    # Grab the HTTP Response object
    resp = requests.get(url, headers=headers)

    pages = []

    # If the status code is 200, then continue
    if resp.status_code == 200:
        # Parse the html
        soup = bs4.BeautifulSoup(resp.text, "html.parser")

        # Grab links from result__url class
        result_url_tags = soup.find_all(class_="result__url")

        # grab the text from each tag.
        for tag in result_url_tags:
            pages.append(tag.get_text())
    
    return pages


def main():
    pages = duckduckgo_search_results("site:google.com")
    print("Page Count: " + str(len(pages)))

if __name__ == "__main__":
    main()

You’ll notice that the only header set is the User-Agent, which makes the bot appear as a browser.

Once the query and results are returned back by using the requests module, the BeautifulSoup html.parser is used to parse the HTML returned by DuckDuckGo.

Afterwards, it is easy to use selectors to extract URLs from the search results since each is wrapped with a CSS class of result__url.

Final Words

In this blog post, you learned how to crawl for web pages using DuckDuckGo and a bot made from python.

Category: Bot

Building a DuckDuckGo Web Crawling Bot in Python

The Code

Final Words