Problem of the Day
A new programming or logic puzzle every Mon-Fri

Subreddit Count

Problem of the Day has a large overlap with the Reddit community. Let's put that overlap to use and create a program that counts the number of total subreddits. You can use this API https://www.reddit.com/dev/api#GET_subreddits_new to traverse the Reddit database.

Permalink: http://problemotd.com/problem/subreddit-count/

Comments:

  • Kevin Benton - 9 years, 3 months ago

    This takes forever to complete because of API rate-limiting. Here is a python solution that uses multiple coroutines to connect multiple times but the server is actually the limiting factor here:

    import eventlet
    import json
    import random
    import requests
    
    
    # use different front-ends (found from dns lookup)
    # so crawlers aren't hitting the same server all
    # of the time.
    reddit_addresses = [
        '198.41.209.140',
        '198.41.208.141',
        '198.41.209.141',
        '198.41.208.142',
        '198.41.208.140',
        '198.41.209.139',
        '198.41.209.138',
        '198.41.209.142',
        '198.41.209.136',
        '198.41.208.139',
        '198.41.208.138',
        '198.41.208.143',
        '198.41.209.143',
        '198.41.209.137',
        '198.41.208.137'
    ]
    
    
    def get_next_data(after, sort_order='new', retries=10):
        url = 'http://%s/subreddits/%s.json?limit=100' % (
            random.choice(reddit_addresses), sort_order
        )
        if after:
            url += '&after=%s' % after
        try:
            r = requests.get(url, headers={'User-Agent': '%030x' % random.randrange(16**30),
                                           'Host': 'www.reddit.com'})
            d = json.loads(r.text)['data']
        except:
            try:
                print 'json loading of following failed: %s' % r.text
            except:
                print 'failed to connect %s' % r
            if not retries:
                return None
            return get_next_data(after, sort_order, retries-1)
        return d
    
    
    def crawl_until_duplicate_or_end(subreddits, after):
        while True:
            d = get_next_data(after)
            after = d['after']
            names = set(map(lambda x: x['data']['display_name'], d['children']))
            print names
            # keep going if any members are unique
            if names.issubset(subreddits):
                return names
            subreddits.update(names)
            print "new set size %s" % len(subreddits)
            if not after:
                return names
    
    
    def multicrawl_names(names):
        # use eventlet to bound number of crawlers
        pool = eventlet.greenpool.GreenPool(size=len(reddit_addresses))
        results = []
        for n in names:
            results.append(pool.spawn(crawl_until_duplicate_or_end, subreddits, n))
        return map(lambda r: r.wait(), results)
    
    
    if __name__ == '__main__':
        # seed workers using popular subreddits. then work by age from each one
        # until it hits a point where another has already crawled
    
        top = get_next_data(None, sort_order='popular')
        top_names = map(lambda x: x['data']['name'], top['children'])
        top_names.append('programming')
        subreddits = set()
        ends = multicrawl_names(top_names)
        sanity_count = 0
        # everything should be crawled, but retry the ends again to verify
        while sanity_count != len(subreddits):
            sanity_count = len(subreddits)
            map(multicrawl_names, ends)
        print "found %s total subreddits" % len(subreddits)
    

    reply permalink

  • Max Burstein - 9 years, 3 months ago

    Yea since I don't think you can guess the hash it makes it tough to do in parallel.

    reply permalink

  • Kevin Benton - 9 years, 3 months ago

    Forgot to put the estimate this stopped (or got blocked at). 510149 subreddits.

    Seems kinda low based on the ages of the ones that were going by in the output around 300k.

    reply permalink

Content curated by @MaxBurstein