Skip to content
October 16, 2011 / codehost

Relational Algebra

With the new academic year underway I might have a lot of stuff to write about. The most interesting and exotic thing I found was in the Introduction to Databases class from Stanford. Of all the subjects out there, what on earth could be so exotic about databases? Well, the title of the post already gave it away.

 

So what is it?

Relational algebra is, well, an algebra. I would normally avoid math like the plague, but this one is simple and fun. It’s relational because it operates on relations. Relations are better known as tables. Tables are collections of tuples, also known as rows. Compared to SQL, tables in relational algebra are sets, so there are no duplicate tuples.

The schema of a database is represented by the names of the relations and the names of their columns. For example, let us make 3 relations named Student, University and Apply. Then lets define the columns for each:

  • Student(sid, sname, gpa), where
    • sid is a student ID
    • sname is the student’s name
    • gpa is the GPA of that student
  • University(uname, city, enr), where
    • uname is the name of the University
    • city is where it’s situated
    • enr is the number of enrolled students
  • Apply(sid, uname, major) defines where students have applied. sid and uname have the same meaning as in the previous relations. major is the major the student has applied for.

Actual data stored under the schema of a database is called an instance.

 

Operators

Operators in relational algebra are functions that take relations and filter conditions and return new relations.

Relation name
The most basic operator out there is the relation name itself. For example:

Student

will return a table like this:

sid sname gpa
123 Zaharia 9.3
234 Gigel 9.7
345 Delia 9.5

Select
To filter a relation the select operator is used. The operator symbol is sigma \sigma . For example, lets take only the students with GPA > 9.4:

\sigma_{gpa > 9.4} Student

Projection
The projection operator is used to make a table with a subset of a relation’s columns. The symbol is \pi :

\pi_{sid, sname} Student

resulting in:

sid sname
123 Zaharia
234 Gigel
345 Delia

Cross product
Since relations are sets, operations on sets are also valid. The most useful one is the cross product:

Student \times Apply

will make all possible combinations between students and applications. Because it’s not very useful data, a filter operator could be used to select only those resulting tuples that have the same student ID and a projection to keep only one sid entry. This will practically join the 2 relations indicating the student info and where he has applied:

\pi_{Student.sid, sname, gpa, uname, major} (\sigma_{Student.sid = Apply.sid} (Student \times Apply))

Natural join
In the previous example we had to apply select, differentiate the sid from the 2 relations and apply a projection. There is a shorthand operator that does just that. It’s called natural join and is represented with a bow tie operator \bowtie . That ugly and long formula is equivalent to:

Student \bowtie Apply

Union, set minus and intersection
The rest of the set operators work as you’d expect. Keep in mind that the result will not have duplicate values.

Rename
Sometimes the relations have to be renamed, that is the column names have to be changed. The operator is \rho and it takes a relation and a new schema for it. A curious example is the natural join of the same 2 relations. For instance, to take the pairs of Universities in the same city:

(\rho_{u1(n1, c, e1)}University) \bowtie (\rho_{u2(n2, c, e2)}University)

Cool, isn’t it? One of the problems would be that the results will include combinations of, for example, Politehnica – Politehnica. A select could be applied to have only results where n1 \neq n2 :

\sigma_{n1 \neq n2}((\rho_{u1(n1, c, e1)}University) \bowtie (\rho_{u2(n2, c, e2)}University))

This one will then have pairs with Politehnica – ASE and ASE – Politehnica. To avoid that, a simple solution is to select alphabetically with n1 < n2 :

\sigma_{n1 < n2}((\rho_{u1(n1, c, e1)}University) \bowtie (\rho_{u2(n2, c, e2)}University))

 

Conclusion

I find it nice that a rather boring subject has, in fact, a neat theoretical foundation. All structured query languages like MySQL, Oracle or PostgreSQL are based on relational algebra. I read a forum once where students from my uni complained that a part of the database course deals with relational algebra and it’s useless in the real world. I guess it’s useless for most people, but not for those that design new languages or want a deeper understanding of them. For more info, watch the course online.

October 5, 2011 / codehost

Facebook Authentication

I got to the point of having to write my own code for Facebook authentication. I’m trying, for a significant amount of time now, to make a Facebook App written in Node.js. There are some modules out there that claim to do this. They were close, but no cigar.

 

Overview

Facebook uses some kind of combination of the OAuth, OAuth2 protocol and some stuff I don’t really know the origin of. Understanding the server side flow is easy and fun. There is a dedicated page that explains it better than this post.

Imagine you’re a Facebook App. Better still, imagine you’re the server that does all the work. The HTML pages you send will be inside what they call a Canvas. It’s a fancy name for an iframe. All you need to get user data using the Facebook API is a key named access_token. There are several ways to obtain it, but first…

 

So a user decides to visit

So a user decides to visit your app. He’s not logged in. This means Facebook would not send any user data whatsoever. In this case, you can just greet the anonymous buddy with a sad message informing him that he cannot pass. The best thing to do though is redirect him to the login page. The beauty of the login page is that you can tell Facebook to send him back to you after entering his credentials. Here’s the vanilla login dialog page: https://www.facebook.com/dialog/oauth.

You’re not gonna get far just with it. Facebook is so cool, that they’ll greet you with a helpful message describing the problem. Well, not really, but you need to add several GET parameters to avoid that error:

  • client_id The application ID
  • redirect_uri Where you want your user to go after this (read “the URL of your app”). I have an app set up for all the testing needs at http://apps.facebook.com/nodejs-tests (sometimes it’s https, I mess with it a lot)

The best way to redirect is by executing a client-side script to make everything seamless. So let’s create the correct login URL (it’s good practice to encode it, but works like this, too):

https://www.facebook.com/dialog/oauth?client_id=127778710657519&redirect_uri=http://apps.facebook.com/nodejs-tests

 

The user is logged in, now what?

An interesting fact is that the user’s browser will never directly send a request to the server. Facebook will intercept everything and then make a POST request to the hosting server.

The signed request
This request will most likely contain a signed_request in the data section. This is what it looks like: vlXgu64BQGFSQrY0ZcJBZASMvYvTHu9GQ0YM9rjPSso.eyJhbGdvcml0aG0iOiJITUFDLV
NIQTI1NiIsIjAiOiJwYXlsb2FkIn0

The above request isn’t valid – it would be too long. With a real one the trick is to decode the part after the dot from base64 and a neat JSON string will be revealed. For example:

{
  "algorithm": "HMAC-SHA256",
  "expires": 1317765600,
  "issued_at": 1317760848,
  "oauth_token": "X",
  "user": {
    "country": "ro",
    "locale": "en_US",
    "age": {
      "min": 21
    }
  },
  "user_id": "658951951"
}

That’s my user data. The user_id is real, but the oauth_token has been replaced with a nice sequence of Xs. That access token that I’ve hidden would empower an app to get whatever information about me from the Graph and FQL and is what we’ve been looking for.

A good app must check the validity of the data it received by computing the SHA256 hmac of the JSON string using the Application Secret – the private key known by the app developers. It must equal the base64 decoded string preceding the dot in the signed_request.

The authorization code
The method described above is probably the best way to obtain the access token. That being said, sometimes Facebook is a jerk and won’t give you a signed_request. The user will go to the login dialog and if he’s already logged in, he’ll be unnoticeably redirected to your redirect_uri with code as one of the parameters. That’s the authorization code.

The access token can be retrieved from the Graph at this address:

https://graph.facebook.com/oauth/access_token?client_id=YOUR_APP_ID&redirect_uri=YOUR_URL&client_secret=YOUR_APP_SECRET&code=THE_CODE_FROM_ABOVE

That’s a request sent by the server, preferably asynchronously. The link is almost real, containing the actual app data for the first 3 parameters and an authorization code that I invalidated for this post (by replacing with Xs).

A correct request will result in a textual response (or a JSON error message) like this one, containing the access_token and the expiration time in seconds:

access_token=X&expires=6455

 

Conclusion

You can make all sorts of data queries using the Facebook API and a valid access token. There are some neat tools that make the Graph and FQL exploration a breeze. First is the Graph API Explorer and then the FQL query page.

If you want to use a Node.js wrapper for the stuff I talked about here (the wrapper is in fact better than what I described) go to my Github repo or install using npm like this:

npm install facebook-wrapper

September 26, 2011 / codehost

We live in a fictional universe

I’m fascinated by how we’re born with a clean mental slate, what they call in pretentious Latin a “tabula rasa”. Whatever we acquire shapes our world view and carves the path towards more knowledge and interests. My life, for instance, felt like a journey where the world stays physically the same, but the way I interpret and see it keeps changing.

 

What we do with the little we have

Think back to your early childhood. Do you remember the way you imagined how things worked? When I was in kindergarten, whenever someone mentioned “the human organism” I imagined a mechanical system with rotating gears inside us. I knew it didn’t make any sense, but there was simply no other point of reference – that was all I knew, so I had to learn more about it. When I learned that the Earth was round – my first ever thoughts about the shape of our planet, in fact – I didn’t question it. After all, the Moon and Sun are round, so it wasn’t so far fetched. But I was wondering what would happen if someone fell off? Where would he end up, would he float endlessly? I couldn’t grasp the concept of infinity or that of gravity.
gears

 

Growing up

Did anything change all these years? Well, we certainly have a more detailed, a more precise view of the Universe, but in reality we’re still like some kids, unable to understand some basic principles and inventing all sorts of wacky concepts or even deities to fill the void. I’m talking about things like the particle physics. Pity I’m not smart enough to understand it, but I have some general concepts picked up from my freshman university courses.

One thing I know is that the subatomic particles are already so small that the only way to observe them is by throwing other subatomic particles at them. This will inadvertently change the position and velocity of the particle we are trying to observe so the results will not be accurate. If we want less deviation the particle thrown must have less frequency/energy, but this leads to more approximate readings, like a lower “resolution”. Increase the energy and you have more precise readings but increased uncertainty. This is kind of the Heisenberg’s Uncertainty Principle, explained in everyday words.

uncertainty

It makes sense in layman’s terms, but scientists stopped describing physics using words since, as far as I know, the 17th century. An advanced mathematical apparatus evolved to describe quantum physics using complicated statistics, matrices, the relativity theory and other gibberish. It’s so advanced that only a handful of people on the planet can grasp it and when they try to explain it back to the world, humanity ends up with something as God-awful as The Schrödinger’s cat.

 

Schrödinger made this up as a thought experiment to prove a point, but what I see is people taking this example as proof that quantum physics is something mystical and extremely thought provoking because, you see, when a state in our macrocosm depends on a particle with theoretically 2 contradicting states at the same time, it makes our heads explode in smug satisfaction for penetrating the deep secrets of our half-assed imagination. More like:

lolcat

 

Just my $0.02

No, imagining what the string theory explains to us, again in ordinary words we can understand, that the particles are little vibrating strings attached to a ginormous brane, and when 2 branes collide, a Big Bang occurs… This sounds just as ridiculous as a digestive system composed of metallic cogs and gears. I’ve grown to accept that this is as far as I can go to building a consistent, logical world view. It isn’t very logical and just raises further questions.

Why have I been ranting about this, you ask? Well it’s this book I read, about primitive societies (traditional societies, as the politically correct authors call them). It’s mostly dull, but there was this chapter that told how, at first, people refused to believe they are utterly gone after death. Their relatives stayed with them as spirits. As more generations passed, it is possible that such a large number of spirits (both good and evil) were revered, that many natural phenomena were explained as their actions. As time passed, the world grew more and more magical, in the eyes of the prehistoric people. The fact that they lived in relatively small areas didn’t help. The lands beyond the horizon would be called “the land of the dead”, like something out of bounds and incomprehensible.

Basically, we project our knowledge to create an imaginary world, a gross approximation, like a wrapper to the physical reality, so we can meaningfully interact with it. Putting myself in their place, imagining such a world view, was fascinating. Nowadays we think we’ve got it all right, but in reality we still know very little. To paraphrase what H. P. Lovecraft said, who knows what terrifying truths await us.

 

Conclusion

Anyways, I wrote this post, completely unrelated to the computer sciency tone of the blog, so that maybe I could read it in a couple of years and see what changed, how I would react to my past, more ignorant self.

September 19, 2011 / codehost

Interpreting Brainfuck

Brainfuck is an esoteric programming language. It was made to be extremely minimalist yet Turing complete. That means it’s possible to code solutions for any algorithmic problem just like in most other languages. The problem is it’s also a Turing tarpit. Now this shows that computational power doesn’t mean expressive power. Programmers have a hard time comprehending even the simplest of operations, like assigning a constant.

Anyway. Why I’m bringing this up is because a student organization at my University called ROSEdu organizes periodic courses called CDL – Open Development Course. ROSEdu is made up of talented hackers and CDL is like a bootcamp. The main goal of this organization is to promote the Open Source Community to our students. You can check out their new blog – ROSEdu Techblog.

Before I stray too much from the subject, CDL has limited spots so the candidates have to submit a solution to some of the proposed problems, found here (it’s in Romanian). I checked some of them out and, after working on the tar archiver (almost making it work), I bumped into problem 5: A Brainfuck interpreter. My reaction was “What!? A compiler? For freshmen!”. I don’t intend to apply, I already attended one of these courses more than a year ago. Besides, I’d be too old. I was doing this for fun though, but an interpreter sounded like way too much.

 

The language

I’ve heard about Brainfuck before and saw bits of code. My eyes bled. I thought the one who unleashed this plague of a language should be cast out of geekdom. What I didn’t know is that the only easy thing to do with BF is writing a compiler for it. Let me summarize the syntax.

The flow is somewhat similar to assembly. There is a data buffer, practically and array initialized with 0. All operations are done in a cell from this buffer. The number of instructions is very small, only 8 of them:

  • > – move the data pointer to the right in the data buffer. The equivalent of p++ from C
  • < – move the pointer to the left. Like p--
  • + – increment the value pointed in the data array by 1. As in *p++
  • - – decrease the value by 1. Similar to *p--
  • . – print the ASCII character (represented as a number) pointed in the buffer. printf("%c", *p)
  • , – read a character from the input and store in the pointed cell. scanf("%u", p) or more like the good ol’ (int) getch()
  • [ – a conditional/loop operator. If the pointed value is 0, step over the entire block between this character and the matching ]. Something in the lines of while(*p) {
  • ] – if the value is not null, jump to the instruction right after the matching [ bracket. In C it would match the previous instruction with }

 

An interpreter

I figured writing your own interpreter is trivial. Rightly so, because there are some implementations out there in less than 200 bytes. There’s practically 0 algorithmic contribution for a simple version. Of course, there are probably lots of improvements and shortcuts to make it fast, but I won’t bother. I wrote my own version in Python, partly to exercise this language (didn’t have much contact with it yet).

My version will have a circular data buffer with at most 5000 elements. It will be represented by a bytearray. The elements are bytes. Since Python doesn’t offer 8 bit data types (or I don’t know of them), all value operations will be modulo 256. Well, the 2 of them that is.

Since the open function will give a buffer and by default it loads the file data in memory, I’ll keep the instructions in this buffer and iterate with its cursor. The standard input and output will be used directly.

As a slight optimization, the program will save the position of the next character after each [ and put it in a stack. This way when ] is encountered a direct jump to the correct position can be done in O(1).

This is the resulting small source code:

#!/usr/bin/python
import sys, os

if len(sys.argv) != 2 or not os.path.isfile(sys.argv[1]):
	print 'Usage: %s valid_file' % (sys.argv[0])
	sys.exit()
buffLen = 5000 # data buffer size
maxNum = 256 # operating with bytes
data = bytearray(buffLen) # data buffer initialized with 0
p = 0 # data pointer
jump = [] # last character after a '[': for faster loops
f = open(sys.argv[1], 'r') # code buffer
c = f.read(1) # current character

while len(c):
	if c == '<':
		p = (p - 1) % buffLen
	elif c == '>':
		p = (p + 1) % buffLen
	elif c == '+':
		data[p] = (data[p] + 1) % maxNum
	elif c == '-':
		data[p] = (data[p] - 1) % maxNum
	elif c == '.':
		sys.stdout.write(chr(data[p]))
	elif c == ',':
		text = sys.stdin.read(1)
		if len(text) == 1:
			data[p] = ord(text)
	elif c == '[':
		if data[p] == 0:
			while c != ']': c = f.read(1)
		else:
			jump.append(f.tell())
	elif c == ']' and len(jump):
		if data[p]:
			f.seek(jump[-1])
		else:
			jump.pop()
	c = f.read(1)

 

Testing

The first program to test is, of course, the classic “Hello World!”:

 ++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.

 

Unfortunately it makes little sense. It just does lots of adding and subtracting to match the ASCII characters in the “Hello World!” string. Extremely convoluted. The source code with comments, if you can call them that, is on Wikipedia. Some basic tests can be done with snippets from here. They seem to mostly pass, except the bounds checking because the data buffer is circular. I don’t know why, but most of the examples there just don’t do what they say. I suspect it’s the interpreter. Maybe it’s missing something. I would really like you to help me by pointing out where I did wrong.

 

Conclusion

Come to think of it: this is my first compiler. Well, the most simple one in the world, but still. Also, it seems like today I finally wrote a smaller post. At least that’s how it felt. I hope you enjoyed it. Don’t hesitate to suggest improvements or corrections.

September 13, 2011 / codehost

Fuzzy substring matching

Recently I saw the movie Commando for the first time. A very funny movie, even though it’s not a comedy. It’s a gold mine for Arnie one-liners, my favorite kind of one-liners. I practically found the source of the jokes about macho action films with all the cliches and inconsistencies.

Meanwhile in the world of tech stuff, I cooked up a search function for the Friend Sieve app. At first I thought I’d add some existing search code/framework, like the one Google offers. The nature of the site’s dynamic data forced me to implement my own tools for this. I also wanted to preserve the ordering and ranking of each friend, even with a partial list of name matches.

 

Version 1
I wanted an easy solution, just to see how everything worked. The ordinary substring search algorithms, even the optimized ones like KMP, aren’t good for human users. What if someone wants to search a friend, but doesn’t know his exact name, or the name isn’t ordinary or includes non-ASCII characters that are a pain to type?

I’m not much into algorithms, but I knew of some out there that make partial matches. For example, I implemented a levenshtein algorithm for the edit distance in a homework last semester. Besides, PHP already has a levenshtein function. I wanted the edit distance for a substring, not the entire text, and there weren’t any parameters in the built-in function that could help.

I ended up using metaphone. The function takes a string and returns how it will be pronounced. In my case, “Vlad Bagrin” will yield “FLTBKRN” and “Vlad” or “Vld” will result in “FLT”. By metaphoning the name and pattern and searching with substr, I got fuzzy string matching cheaply.

Now that I solved the problem, I could move on, because everything was OK, right? Wrong! The algorithm still missed close matches that contained special Romanian characters. Metaphone in that form was only good for English spelling. Some of the matches it found were out of place. They made me laugh. If Matrix was here, he’d laugh too.

 

Version 2
Remember when I said I knew the levenshtein algorithm? I lied! I had pretty much forgotten how it works and was unable to infer any special properties that could help. What did I do with metaphone? I let it go. To replace it, I tried finding a regular expression. I failed. Then maybe a regex to match all substrings between the first and last characters in the pattern, like /v.*d/i for “vlad”, then find the one with the smallest edit distance. That regex fails to find all matches because the * wildcard is greedy. Making it lazy didn’t help either.

All the searching through the Internet and books almost got me dead tired. I knew there was something simple out there, I just wasn’t smart enough to figure it out or at least google it properly. Then, finally, I stumbled on this short, clear and helpful blog post that delivered me from ignorance. The solution was so simple and obvious, I’m surprised I didn’t think of it. Well, actually, I’m not surprised at all…

Basically, with a small change in the levenshtein algorithm, you can find the edit distance for a partial match, not the entire text. Jeez, I finally let off some steam! In the good, not the Bennett, sense.

 

Levenshtein
The vanilla algorithm compares two strings s and t and returns the number of edits performed on s to become t. The allowed edits are insertion, deletion and substitution of a character. It’a an example of dynamic programming, as all the prefixes of s and t will be tested, bottom up, resulting in a matrix of m * n edit distances, where m and n are the lengths of s and t respectively.

 

Let s = "abac" and t = "abba". The resulting matrix will be:

a b b a
0 1 2 3 4
a 1 0 1 2 3
b 2 1 0 1 2
a 3 2 1 1 1
c 4 3 2 2 2

 

To clarify it, let’s look at the table and see how many edits we have to make for different prefix combinations.

  • s = “” and t = “a”: 1 edit, that is – adding ‘a’ to s
  • s = “” and t = “ab”: 2 edits, add ‘a’ then ‘b’ to s
  • s = “aba” and t = “”: 3 edits, remove ‘a’, ‘b’ and ‘a’ from s. That’s how the first line and column are initialized. Because the other prefix is empty, you have to add the other string entirely.
  • s = “ab” and t = “abb”: 1 edit, since adding ‘b’ to s is enough

 

For the more complex cases, let’s look at the last line, second column: s = “abac” and t = “a”. Because the table cell matches ‘c’ from s and ‘a’ from t and they differ, the options we have are:

  1. Add ‘a’ at the end of s, solving our current problem, then check how many edits are needed to change the remaining “abac” from s into what’s left of t – well, nothing. This means a cost of 1 for addition and 4 for the rest of the edits, taken from the memoized data. This equates to 1 + d[i][j - 1], where d is the matrix, i is the current line and j is the current column.
  2. Remove ‘c’ from s and deal with s = “aba” and t = “a”. This is equal to 1 + d[i - 1][j]
  3. Replace ‘c’ with ‘a’ in s and find the edits for s = “aba” and t = “”. The same as 1 + d[i - 1][j - 1]

These were the cases when the current characters under scrutiny differ. We check the 3 options and choose the cheapest one, with the minimum number of edits.

If, on the other hand, the characters are equal, then the cost is 0 and all we have to do is find the edits for the smaller prefixes, that is d[i - 1][j - 1]. In the end, the bottom right cell of the matrix will have the minimum edit distance.

 

Pseudocode
Armed with an intuitive understanding of the steps, the code is pretty straightforward:

levenshtein(s, m, t, n):
    d = matrix(m + 1, n + 1)
    d[0] = range(0, n)
    for i in range(1, m):
        d[i][0] = i
        for j in range(1, n):
            if s[i - 1] == t[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                d[i][j] = min(
                    d[i][j - 1],
                    d[i - 1][j],
                    d[i - 1][j - 1]
                ) + 1
    return d[m][n]

 

Despite PHP having a built in levenshtein that would probably be light years faster than a hand written one, I’ll post an example here, then work out the necessary adaptations into it. For PHP I won’t use a matrix, because it’s so uncomfortable operating with them in this language that I’d rather code with my feet. Instead, I’ll use the observation that we don’t really need the whole matrix to just compute the number of edits – only the current and previous rows. I’ll add the values to the current row, sometimes using the previous one, then swap them in the next iteration.

function my_levenshtein($s, $t) {
	$n = strlen($t);
	$m = strlen($s);
	$curr_row = range(0, $n); // d[i]
	$prev_row = array_fill(0, $n + 1, 0); // d[i - 1]
	
	for ($i = 1; $i <= $m; $i++) {
		$tmp =& $curr_row; // swap the rows => i++
		$curr_row =& $prev_row;
		$prev_row =& $tmp;

		$curr_row[0] = $i;
		for ($j = 1; $j <= $n; $j++) {
			if ($s[$i - 1] == $t[$j - 1]) {
				$curr_row[$j] = $prev_row[$j - 1];
			} else {
				$curr_row[$j] = min(
					$curr_row[$j - 1], // d[i][j - 1]
					$prev_row[$j - 1], // d[i - 1][j - 1]
					$prev_row[$j] // d[i - 1][j]
				) + 1;
			}
		}
	}

	return $curr_row[$n];
}

 

Adapting to our needs
As the helpful blog explained, all you need to do to find the number of edits of a substring match is:

  1. Initialize the first row of the matrix with 0. That’s like saying that the number of edits before starting to match s is 0. We just don’t care.
  2. Likewise, the number of edits done after we stopped matching s is irrelevant to us so we might as well take the minimum value in the last row.

For s = "sa" and t = "tara", the resulting matrix is:

t a r a
0 0 0 0 0
s 1 1 1 1 1
a 2 2 1 2 1

Resulting in just 1 edit for a match of “sa” to “ta” or “ra”.

 

Before listing the PHP code, I’d like to point out that we might be interested in a match only if the number of edits doesn’t exceed a specified value. That value could be 25% of the pattern length, for instance. This way I filter out the friend names that don’t match at least 75% of the pattern anywhere in the string.

Now, imagine that we have a max_edits number that shouldn’t be exceeded. During row iteration, the current row could produce values that are all > max_edits. This could be the last row and in this case the minimum number of edits was too large. If this wasn’t the last row, it will become the previous row in the next iteration. In that case, even if you match all the characters and don’t have to increase the number of edits, you’ll have to add values from a row where all numbers are already too large. This means that the function can safely return a negative result, meaning no matches were found.

The PHP function will store the minimum number from the current row and return false if it’s too large. If all iterations were passed successfully, then the last minimum value was good enough and the function returns with true.

function levenshtein_substr($s, $t, $max_edits) {
	$n = strlen($t);
	$m = strlen($s);
	$curr_row = array_fill(0, $n + 1, 0); // d[i]
	$prev_row = array_fill(0, $n + 1, 0); // d[i - 1]
	
	for ($i = 1; $i <= $m; $i++) {
		$tmp =& $curr_row; // swap the rows => i++
		$curr_row =& $prev_row;
		$prev_row =& $tmp;

		$minim = $curr_row[0] = $i;
		for ($j = 1; $j <= $n; $j++) {
			if ($s[$i - 1] == $t[$j - 1]) {
				$value = $prev_row[$j - 1];
			} else {
				$value = min(
					$curr_row[$j - 1], // d[i][j - 1]
					$prev_row[$j - 1], // d[i - 1][j - 1]
					$prev_row[$j] // d[i - 1][j]
				) + 1;
			}
			if ($value < $minim) {
				$minim = $value;
			}
			$curr_row[$j] = $value;
		}

		// Check if the previous row will exceed max_edits
		if ($minim > $max_edits) {
			return false;
		}
	}

	return true;
}

 
Conclusion

This is how searching is done now in the app. If you want to test it, go to Friend Sieve. Don’t forget to recommend it and post to your wall. Thanks for your attention… I realize this post is too long, like the previous ones, but I couldn’t help it. Also, I couldn’t resist acting out how this algorithm works on the site:

Pattern: You scared, motherfucker? Well, you should be, because this pattern is going to kick your big ass!
App: I eat patterns for breakfast. And right now, I’m very hungry!

Sorry, I’ll stop infiltrating one-liners into my posts…

September 4, 2011 / codehost

Cliques and Social Networks

I’m developing a Facebook Application[1]. It sorts and lists your friends by how much they interacted with you. The app is almost complete, although it’s not what I had in mind initially. There are still several problems, mainly because the Facebook API hates me.

I also checked out other apps that deal with the social graph and categorizing your friends. The coolest one is Social Graph[2]. It shows a nice Flash animation of your social graph. Another thing it does is clustering the friends with similar connections and encircling them. This gave me the idea of investigating the cliques that a user’s mutual friends might form.

Maximal cliques

The problem of finding and listing the maximal cliques of an undirected graph is NP Complete. The most widely implemented solution for it is the Bron-Kerbosch algorithm. It runs in linear time relative to the number of cliques. Unfortunately, the number of cliques in a graph can be very large. An undirected graph with n vertices can have up to 3^{\frac{n}{3}} maximal cliques. That means even the best algorithms out there have exponential complexity.

Description
You can read a presentation of the Bron-Kerbosch algorithm[3] or a more in depth description[4]. The basic idea is building the cliques by backtracking. 3 sets are considered: R , P and X . R is the clique under construction. P contains all the vertices connected to the one previously inserted in R . X are all the nodes that have already been processed. For each recursive call, vertices v \in P are selected and new sets R \cup \{v\} , P \cap N(v) and X \cap N(v) are formed for the next level of the call tree, where N(v) are the neighbors of v . Then v is removed from P and added to X . The stop condition is to have both P and X empty.

Code
Let’s write a pseudocode for it, taken slightly modified from Wikipedia[5]:

function BK(R,P,X):
    if P and X are both empty:
        report R as a maximal clique
    for each vertex v in P:
        BK(R ⋃ {v}, P ⋂ N(v), X ⋂ N(v))
        P := P \ {v}
        X := X ⋃ {v}

 

Improvement
The algorithm can be improved with a pruning heuristic. [4] describes it well. Basically, we can choose a pivot u \in P and partition P in sets that contain the neighbors of u and those that don’t (including u ). If u will be placed in R , all of it’s neighbors will eventually reach R somewhere deeper in the call tree. There’s no need to explore permutations of the same clique. Ignoring all nodes connected to u will not change the results.

Some of the pivot choice heuristics include selecting one randomly or choosing the vertex with most connections (largest degree). I’ll stick with the degree method, as it makes more sense to me. The template algorithm with pivoting is:

function BK(R,P,X):
    if P and X are both empty:
        report R as a maximal clique
    choose a pivot vertex u in P
    for each vertex v in P:
        if v is not connected to u:
            BK(R ⋃ {v}, P ⋂ N(v), X ⋂ N(v))
            P := P \ {v}
            X := X ⋃ {v}

 

Implementation

I’ll implement that algorithm in PHP, because it will eventually end up in a Facebook App. The data type for the graph representation is an associative array where the key is the vertex ID – a number – and the value is an array of IDs – a neighbor list. In PHP all arrays are associative, so our task is somewhat simplified. I said somewhat because PHP is an inherently problematic language. Some basic actions are inexplicably not straightforward – like taking the first element of an array. Removing a known value from an array isn’t possible in linear time, so I wrote an unorthodox iteration. I could go the same route with pivot selection, but I decided not to bother.

function bron_kerbosch($graph, $r, $p, $x) {
	global $recursive_calls;
	$recursive_calls++;
	if (count($p) == 0 && count($x) == 0) {
		process_clique($r);
	} else {
		$pivot = select_pivot($graph, $p);
		$i = count($p);
		while ($i > 0) {
			$i--;
			$v = array_shift($p);
			if (!in_array($v, $graph[$pivot])) {
				bron_kerbosch($graph, array_merge($r, array($v)), array_intersect($p, $graph[$v]), array_intersect($x, $graph[$v]));
				array_push($x, $v);
			} else {
				array_push($p, $v);
			}
		}
	}
}

function select_pivot($graph, $p) {
	$pivot = -1;
	$max = -1;
	foreach ($p as $vertex) {
		$degree = count($graph[$vertex]);
		if ($degree > $max) {
			$max = $degree;
			$pivot = $vertex;
		}
	}
	return $pivot;
}

 

Facebook Integration

The documentation for the Facebook API[6] is cumbersome and talking about it in detail is unnecessary. Let’s just consider the magic function logged_in_check and the associated variables that will take care of setting up a wrapper for the API and prompting the user for required data. I’ll still list the code though:

// Facebook stuff
// App information
require_once('facebook/src/facebook.php');
header('Content-Type: text/html; charset=utf-8');
$app_secret = 'XXX';
$app_id = '133058413445569';
$app_addr = 'http://apps.facebook.com/sandbox_vladb/';

// Part of redirect script
$js = "<script type=\"text/javascript\">top.location.href =";

function logged_in_check() {
	global $js, $app_addr, $app_secret, $app_id;
	
	$fb = new Facebook(array('appId' => $app_id, 'secret' => $app_secret));
	$user = $fb->getUser();
	if (!$user) {
		$scope = '';
		$params = array('scope' => $scope, 'redirect_uri' => $app_addr);
		$login = $fb->getLoginUrl($params);
		$redirect_script = "$js \"$login\";</script>";
		echo $redirect_script;
		exit;
	} else {
		return $fb;
	}
}

Notice that the $app_addr is the URL where you’ll find the application. Also, I suppose the $app_secret should be kept a secret.

 

Requesting the friend list
Once a user is connected and logged in, the app proceeds to getting the user’s friend data. I implemented it with a FQL query, which is very much like stripped down SQL.

function get_friend_list($fb) {
	$list = array();

	$query = "select uid, name from user where uid in (select uid2 from friend where uid1=me())";
	$list = $fb->api(array('method' => 'fql.query', 'query' => $query));

	return $list;
}

 

Getting the mutual friends
The operation of requesting each mutual friends list is done in a rather similar manner. Here I use a FQL multiquery, which allows batching multiple queries in a single HTTP request. There is a limitation with the Facebook API where the maximum number of returned result entries is 5000, even for multiqueries. I’ll try to get around this limitation, although I couldn’t test it myself as my social network is not large enough to break stuff. The method used to get more than 5000 entries is by sending multiquery requests for a smaller number of friends – 200 for now – that’s my friends number and it didn’t cause any problems. The query list is broken up into several chunks of 200 each. The functions and helpers are listed here:

function get_mutual_friends($fb, $list) {
    $friend_ids = stringify_friend_list($list);
	$query = array();
    foreach ($list as $friend) {
		$id = $friend["uid"];
		$query[$id] = "select uid2 from friend where uid1=$id and uid2 in $friend_ids";
    }
	$queries = array_chunk($query, 200, true);
	$result = array();
	foreach ($queries as $chunk) {
		$result = array_merge($result, send_query($fb, $chunk));
	}
	return $result;
}

function stringify_friend_list($list) {
	$value = '(';
	foreach ($list as $friend) {
		$id = $friend["uid"];
		$value = $value . "$id,";
	}
	return substr($value, 0, -1) . ")";
}

function send_query($fb, $query) {
	$query = json_encode($query);
	
	$param = array(
		'method'   => 'fql.multiquery',
		'queries'  => $query,
		'callback' => ''
	);
	$result = $fb->api($param);
	
	return $result;
}

 

Processing the social graph
With the mutual friends data, the social graph can be constructed to resemble the one used by the Bron-Kerbosch implementation.

function get_social_graph($fb, $list) {
	$resource = get_mutual_friends($fb, $list);
	$graph = array();
	
	foreach ($resource as $entry) {
		$id = $entry["name"];
		$mutual_friends_list = $entry["fql_result_set"];
		$graph[$id] = array_map("process_result", $mutual_friends_list);
	}
	
	return $graph;
}

function process_result($entry) {
	return $entry["uid2"];
}

 

The “main” function
Now I’ll list the remaining unexplained functions and the execution flow, the one actually doing the work. You’ll notice I mark the time it took to run these actions, for benchmarking. For instance, the first version of the algorithm didn’t use a pivot – the clique listing took 20 seconds! With a pivot it only takes 500 milliseconds, for 200 friends.

function process_clique($clique) {
	global $names;
	echo "<li><ul>\n";
	foreach ($clique as $id) {
		echo "<li>" . $names[$id] . "</li>\n";
	}
	echo "</ul></li>";
}

// Access the friend's name by his ID
function assoc_id_name($list) {
	$result = array();
	foreach ($list as $entry) {
		$result[$entry["uid"]] = $entry["name"];
	}
	return $result;
}

function mtime() {
	return round(microtime(true) * 1000);
}

// Work is done here
$recursive_calls = 0;
$time_start = mtime();

$fb = logged_in_check();
$time_login = mtime();

$list = get_friend_list($fb);
$names = assoc_id_name($list);
$time_list = mtime();

$graph = get_social_graph($fb, $list);
$time_graph = mtime();

echo "<ol>\n";
bron_kerbosch($graph, array(), array_keys($graph), array());
echo "</ol>\n";
$time_algorithm = mtime();

echo "Number of recursive calls: $recursive_calls<br>\n";
echo "Time intervals:<ul>\n";
echo "<li>Log in: " . ($time_login - $time_start) . "ms</li>\n";
echo "<li>Friend list request: " . ($time_list - $time_login) . "ms</li>\n";
echo "<li>Social graph request: " . ($time_graph - $time_list) . "ms</li>\n";
echo "<li>Cliques listing: " . ($time_algorithm - $time_graph) . "ms</li></ul>\n";

 

Conclusion

I posted the source code on Github[7] for easier access. The application is up and running at the moment. You can test it here[8]. I hope you enjoyed the article. Have fun.

 
[1] http://apps.facebook.com/friend-sieve
[2] http://apps.facebook.com/socgraph
[3] http://www.dfki.de/~neumann/ie-seminar/presentations/finding_cliques.pdf
[4] ftp://ftp-sop.inria.fr/geometrica/fcazals/papers/ncliques.pdf
[5] http://en.wikipedia.org/wiki/Bron%E2%80%93Kerbosch_algorithm
[6] https://developers.facebook.com/docs/guides/canvas/
[7] https://github.com/vladbagrin/Cliques-in-Facebook/
[8] http://apps.facebook.com/sandbox_vladb/

August 28, 2011 / codehost

Anti-patterns

I began studying Computer Science quite late – in my junior year of high school. The first programs were small homework assignments in pseudocode or even organigrams. They were designed to make you understand the concepts of program flow and basic algorithm design. The nice thing about pseudocode is that when you need some complex data structure or instruction – you can just pull one right out of your ass and move on.

Computer Science in that form already became my favorite subject, trampling physics and math forever. Then we met a real programming language – C. C for a noobie is like the recruit training camp in Full Metal Jacket: it’s tough, requires rote learning (of common function calls that we didn’t understand that time), some will lose their hope along the way and the compiler is constantly screaming at you. The good thing about it is you could start writing real and useful applications right away.

Now, there’s a big difference between little programs that implement some algorithm and large, complex applications with specifications that change too often. In high school, my “real world” programs almost always started as proof of concept snippets. They would grow with little hacks here and there, without much control or guidance, until either they became too complex or I would simply lose interest.

The Journey

It became obvious that I couldn’t continue in this manner anymore. The programs needed to be understood by more people, they had to be modular and easily maintainable. It’s easier said than done, because at first it was a challenge to write code that even I could understand after of period of inactivity. I began learning how to manage the design complexity.

The first thing I learned was that I could program like I used to in pseudocode. Instead of starting with int main() and 20.000 lines of code later end with return 0; – break it in appropriately named functions. You can end up with main routines that describe their actions in almost natural language.

Another interesting feature that bugged me at first was commenting the code. I would place comments in parts of code that I found a little less clear – but the words I put in there didn’t help in the long run, it was never explicit enough. Another favorite use for them was to put ironic or humorous remarks or to commemorate battlefields where I fought nasty bugs. I like to think I’ve grown out of this habit, too.

The next step in this uphill battle was discovering classes in C++. In theory (and in my imagination), I could use them to move the code away from the functions and right into the heart of the problem. Instead of telling the computer what to do, I constructed little worlds made up of program parts that run around and do the work. Unfortunately it had the tendency to devolve into procedural code.

General guidelines

My coworkers have this nice custom where every week someone makes a presentation on some technology topic, usually by popular demand. One of the latest presentations was about good coding practices. The guy talking about it would emphasize clarity over non-essential optimization or the amount of cleverness poured in the source code.

He started with how to write comments and how most programmers hate doing this. The funniest reason not to comment is that “The code is obvious” and guess what – he agrees with it. His opinion is that inline comments are not very useful and it’s better to write somewhere at the beginning of a function definition, as clearly as possible, what it does and not how it does it.

He also had some interesting things to say about presenting a simple interface to your code and encapsulating the details. Someone using your functions will be most productive when he sees just simple actions to work with and not be burdened by things that could have been hidden. This approach leads to fewer errors and less headaches.

He also mentioned that whenever you think of implementing something with inheritance and polymorphism – think again, maybe a simple, loose coupled design will be less complex, less prone to mistakes and easier to test – especially when you can define dummy components when the real ones aren’t ready yet.

Patterns

He talked about many other interesting things. The main idea is that people have been stumbling upon these problems since the dawn of programming. There are good solutions and there are bad solutions. The good ones are called Design Patterns – many problems can be redefined to suit a template solution. Sadly, I find reading about them very boring. They are also almost never in touch with what I work with. The bad ones seem to be reinvented all the time, independently, by inexperienced programmers. I discovered that reading about the so called Anti-Patterns is entertaining and I see that I’ve implemented many of them at one point. It’s also a good start to learning design patterns and a nice way of seeing what to avoid. Most anti-patterns have funny names, adding to the irony of using them.

Why?
An anti-pattern is used mainly because of the instant gratification. Your code will start doing it’s job very soon. The downside is that the code becomes over-encumbered over time and sometimes it reaches a point where it’s cheaper to rewrite everything from scratch rather than trying some refactoring on it.

Spaghetti code
This is an old one and describes incomprehensible code with tangled execution flow. It happened a lot in the days of the GOTO statement – especially in assembly. E. W. Dijkstra wrote about the evils of unstructured code in Go To Statement Considered Harmful.
Did I fall for it? In the early days I gave little regard to code structure. I was even tempted to use labels and goto’s more often when I first learned about them in C – I discovered how easy it was to get away from otherwise laborious loops and conditionals. My CS teacher told me this style didn’t conform to the Procedural Paradigm, whatever that was, so I avoided them ever since.

Big ball of mud
This is when the code has no recognizable structure. It looks like bits and pieces of instructions shoddily duct-taped together by a code monkey, patched here and there, with uncontrolled growth – like a jungle.
Did I fall for it? All the time. Only recently did I begin considering careful planning, with pen and paper, before starting to write.

Magic numbers and Hard code
Whenever unexplained constants are encountered. It’s especially detrimental when they’re present in more places, thus whenever you want to change that constant, you have to carefully browse through the entire program.
Did I fall for it? This is a rookie mistake, I guess everyone makes it… right?
Solution Define these in one easy to find place, give them explicit names and avoid using strings, to reduce complexity. The code will also be much clearer this way because, on the subconscious level, you won’t have to think “This is the number of times the loop will execute”, you’ll actually read it out loud (but most likely in your head).

Lava flow
Keeping obsolete and unused code because removing it would take too much time or it’s uncertain what effects it might produce. This is akin to lava that solidifies underneath another layer, impossible to remove or separate.
Did I fall for it? I think I’m quite the opposite of this: whenever I don’t have a use for some instructions or functions, I delete them. The other option is to comment them – *ugh*. This usually comes back to haunt me as most of the time I _will_ want the removed functionality back and, unable to revert to an older version, I end up rewriting it from scratch.

Input kludge
Not checking for bad input. For example when a user introduces invalid data in the program arguments and crashes it. Even with such input handling implemented, thorough testing should be done to ensure nothing was overlooked. An simple type of testing would be the monkey test, where random input is sent for a while.
Did I fall for it? I was conditioned early on to check the validity of data from unpredictable sources. These are user input and data sent over network. I’m still tempted to leave these checks for later more often than I would like.

Gold plating
Continuing to work on a project when the added value would be minimal.
Did I fall for it? It happens quite often indeed. Sometimes I add features that were not even requested, just for fun. In a real work environment I’d better sort out my priorities.

Object orgy
This is when the object data isn’t sufficiently encapsulated. The behavior of the object will not be as predictable or stable and too much is exposed to the users, increasing the complexity. Now this is a funny name, or maybe it’s just me and my twisted sense of humor.
Did I fall for it? Not really. Like with input kludge, I tend to feel when too much is exposed and end up obsessively writing accessor functions. If I need unrestricted access to some hidden data, it’s a sign of bad design, or code smell as they call it.

Premature optimization
Trying to optimize little bits of code early in the development process. This leads to decreased clarity and ends up with compromises to the design. Like D. Knuth said, “Premature optimization is the root of all evil”.
Did I fall for it? I guess I did: instead of designing a better algorithm, I would work on making an efficient suboptimal one.
Compilers these days are very smart and will do this for us. They can even make loop unrolling, arithmetic optimization and don’t call procedures when an in-place (or inline) execution is good enough.

The Blob or The God object
When one object contains most of the application’s functionality. It’s named after the 1958 movie, where a creature grows by consuming (often unsuspecting) humans, like the object that ends up slowly eating away at the program.
Did I fall for it? Unfortunately, this is the latest anti-pattern I’ve fallen victim to. Some of my homework this semester was susceptible to this, for example a “Game board” or “Mail client” class that ended up containing the entire program logic, while the main function just made initializations and launched a loop.
Solution Group the functions in your blob and find their right home – inside other independent objects.

Conclusion

I could go on and on about these anti-patterns, but it’s getting late. This article is way too long already, I don’t think I’ll read it again to check for errors. Ironically, this post is a full blown anti-pattern.
Anyway, I hope you enjoyed it.
For more information, visit this website or if you’re like me, GOTO Wikipedia.
Feel free to post your opinions.

%d bloggers like this: