Problem of the Day -- Subreddit Count

Subreddit Count

Problem of the Day has a large overlap with the Reddit community. Let's put that overlap to use and create a program that counts the number of total subreddits. You can use this API https://www.reddit.com/dev/api#GET_subreddits_new to traverse the Reddit database.

Permalink: http://problemotd.com/problem/subreddit-count/

Comments:

Kevin Benton - 9 years, 3 months ago

This takes forever to complete because of API rate-limiting. Here is a python solution that uses multiple coroutines to connect multiple times but the server is actually the limiting factor here:

import eventlet
import json
import random
import requests


# use different front-ends (found from dns lookup)
# so crawlers aren't hitting the same server all
# of the time.
reddit_addresses = [
    '198.41.209.140',
    '198.41.208.141',
    '198.41.209.141',
    '198.41.208.142',
    '198.41.208.140',
    '198.41.209.139',
    '198.41.209.138',
    '198.41.209.142',
    '198.41.209.136',
    '198.41.208.139',
    '198.41.208.138',
    '198.41.208.143',
    '198.41.209.143',
    '198.41.209.137',
    '198.41.208.137'
]


def get_next_data(after, sort_order='new', retries=10):
    url = 'http://%s/subreddits/%s.json?limit=100' % (
        random.choice(reddit_addresses), sort_order
    )
    if after:
        url += '&after=%s' % after
    try:
        r = requests.get(url, headers={'User-Agent': '%030x' % random.randrange(16**30),
                                       'Host': 'www.reddit.com'})
        d = json.loads(r.text)['data']
    except:
        try:
            print 'json loading of following failed: %s' % r.text
        except:
            print 'failed to connect %s' % r
        if not retries:
            return None
        return get_next_data(after, sort_order, retries-1)
    return d


def crawl_until_duplicate_or_end(subreddits, after):
    while True:
        d = get_next_data(after)
        after = d['after']
        names = set(map(lambda x: x['data']['display_name'], d['children']))
        print names
        # keep going if any members are unique
        if names.issubset(subreddits):
            return names
        subreddits.update(names)
        print "new set size %s" % len(subreddits)
        if not after:
            return names


def multicrawl_names(names):
    # use eventlet to bound number of crawlers
    pool = eventlet.greenpool.GreenPool(size=len(reddit_addresses))
    results = []
    for n in names:
        results.append(pool.spawn(crawl_until_duplicate_or_end, subreddits, n))
    return map(lambda r: r.wait(), results)


if __name__ == '__main__':
    # seed workers using popular subreddits. then work by age from each one
    # until it hits a point where another has already crawled

    top = get_next_data(None, sort_order='popular')
    top_names = map(lambda x: x['data']['name'], top['children'])
    top_names.append('programming')
    subreddits = set()
    ends = multicrawl_names(top_names)
    sanity_count = 0
    # everything should be crawled, but retry the ends again to verify
    while sanity_count != len(subreddits):
        sanity_count = len(subreddits)
        map(multicrawl_names, ends)
    print "found %s total subreddits" % len(subreddits)

reply permalink

Max Burstein - 9 years, 3 months ago

Yea since I don't think you can guess the hash it makes it tough to do in parallel.

reply permalink
Kevin Benton - 9 years, 3 months ago
Forgot to put the estimate this stopped (or got blocked at). 510149 subreddits.

Seems kinda low based on the ages of the ones that were going by in the output around 300k.

reply permalink

Problem of the Day A new programming or logic puzzle every Mon-Fri

Subreddit Count

Comments:

Problem of the Day
A new programming or logic puzzle every Mon-Fri