lowmanio.co.uk title image

Visualising data: Search Terms

Sat, 21 Aug 2010 08:33PM

Category: Digital Forensics & Malware

Written by Sarah | No comments

I've finally finished the first draft of my thesis, I now have a week and a few days to edit and finish it- which is plenty of time since I'm fairly happy with it as it stands.

Another of Webscavator's visualisations is a word cloud for search engine query terms. The more a term has been searched, the larger it will appear in the word cloud. Screenshot 1 shows an example word cloud. Words are clickable, which pops up a box like in Screenshot 2 with more details.


Screenshot 1: The search term word cloud


Screenshot 2: The term 'python' has been clicked on, which shows a pop-up box with more details.

The following code extracts search terms from a URL.

def getTerms(query):
    """
        Given the query part of a url, extracts the serach terms or phrases 
        (those surrounded by quotes). Returns the search string and a list of 
        terms in that string.
    """
    current_terms = []
    terms = query.split('"')
    actual_query = u" ".join(query.split('+'))
    
    if len(terms) == 1: # no quotes
        for x in terms[0].split('+'):
            if x not in SearchTerms.STOP_WORDS: # STOP_WORDS is a list of words 
                current_terms.append(x.strip().lower()) # to ignore e.g. "the" & "and"
    else:
        for t in terms:
            if t != '' and (t[0] == "+" or t[-1] == "+"):
                for x in t.split('+'):
                    if x not in SearchTerms.STOP_WORDS:
                        current_terms.append(x.strip().lower()) 
            elif t != '':
                current_terms = current_terms + [' '.join(x for x in t.split('+'))]
    current_terms = [c.lower() for c in current_terms]
    
    return actual_query, current_terms 
    
query = url.query.split('q=')[-1].split('&')[0] # find the search terms
q_string, terms = getTerms(urllib.unquote(query))

These search terms are then stored in separate table in the database to make querying more efficient. Webscavator calls an AJAX method that gets all the search terms for the word cloud as a dictionary (See my post on JavaScript perils) amongst some other useful variables. The word dictionary values are tuples containing the word ratio (the number of times it appears divided by all the appearances of words) and a list of phrases the word appears in. To get the word cloud to have font sizes that are readable, the AJAX also returns the word with the smallest ratio. This will always be font size 11 so it is visible. Everything else is relative to that.

$.getJSON('${urls.build("visual.jsonGetWordClouds")|h}', filters, function (obj) {  
    $.each(obj, function(i, val){  // obj is returned from AJAX call
        var words = val[0]; // word is the word cloud dictionary
        var search_engine = val[1]; // what search engine is this, e.g. 'yahoo'
        var factor = 11/val[2]; // what to times each ratio by to get font size, 
                                // with the smallest always being size 11
        
        var search_div = $('.in_panes div:last').append($('<div/>').addClass(search_engine);
        search_div.append($('<div/>').addClass('wordcloud'));
        
        $.each(words, function(term, details){  // for each word in the word cloud

            // make word a shade of grey
            var hex = (0 + d2h(Math.floor(Math.random() * 200))).substr(-2);   
            var w = $('<div/>').addClass('word').html(term).css('color', '//'+hex+hex+hex)
            $('.wordcloud:last').append(w);

            var fontsize = factor*details[0];
            if (fontsize > 300){
                fontsize = 300; // maximum font size...above this really just is too big
            }
            $('.word:last').css('font-size', fontsize); // change the font size
        });   
    })
})

The code above has been simplified quite a bit so the general idea is not lost with formatting code. I have left out how I do the pop-ups, but they are easily done with jQuery Tools.

Comments

No comments.

Add a comment

captcha