Sunday, December 6, 2009

Breathalyzer -- Facebook puzzle

More puzzles; in this post I provide the code for another puzzle provided on Facebook. Note that the unit test code is not included here. You can copy the code below after creating a project in an IDE such as Eclipse (which is what I use).

By the way, this puzzle was interesting. I did spend some time thinking of the algorithm to solve the problem but then started Googling for it. Found out about LevenshteinDistance which solves this problem nicely. Have to confess though that my thoughts on solving this problem were focused on using a tree structure. After finding the solution (and also the code for that) I gave up on that approach.

So given below is the problem and the code. The code contains two different functions for getting the Levenshtein Distance. Was having problems with the first function and hence included the second; now have kept both.

By the way see the example below. The word "tihs" has "ties" as the closest although in the context of the example sentence I was not seeing this until I used the program to clear my mental block. An example of computer not having to deal with the "mental block".


PROBLEM: (from Facebook page)

To safeguard against the dreaded phenomenon of wall posting while drunk, Facebook is implementing a feature that detects when post content is too garbled to have been done while sober and informs the user that they need to take an online breathalyzer test before being allowed to post.

Unfortunately, there is far too much content for a given set of persons to evaluate by hand. Fortunately, you are a programmer of some repute and can write a program that processes and evaluates wall posts. You realize that such a program would be of great use to society and intend to resolve the problem once and for all. The program you write must compute a score for a body of text, returning this score upon completion.

Your program will be given a list of accepted words and run on one wall post at a time. For each word W in the post, you must find word W' from the list of accepted words such that the number of changes from W to W' is minimized. It is possible that W is already W' and thus the number of changes necessary is zero. A change is defined as replacing a single letter with another letter, adding a letter in any position, or removing a letter from any position. The total score for the wall post is the minimum number of changes necessary to make all words in the post acceptable.


Input Specification

Your program must take a single string argument, representing the file name containing the wall post to analyze. In addition, your program must open up and read the accepted word list from the following static path location:
/var/tmp/twl06.txt
For testing purposes, you may download and examine the accepted word list here. When submitting your code, you do not need to include this file, as it is already present on the machine.

The input file consists entirely of lower case letters and space characters. You are guaranteed that the input file will start with a lower case letter, and that all words are separated by at least one space character. The file may or may not end with a new line character.

Example input file:
tihs sententcnes iss nout varrry goud
You are guaranteed that your program will run against well formed input files and that the accepted word list is identical to the one provided for testing.


Output Specification

Your program must print out the minimum number of changes necessary to turn all words in the input wall post into accepted words as defined by the word list file. Words may not be joined together, or separated into multiple words. A change in a word is defined as one of the following:
  1. Replacing any single letter with another letter.
  2. Adding a single letter in any position.
  3. Removing any single letter.
This score must be printed out as an integer and followed by a single new line.

Example Output (newline after number):
8




------------------------------------------------------------------------------------------

package com.farooq;

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;

public class Word_distance {

/**
* @param args
*/
private ArrayList legal_word_list = new ArrayList();

public static void main(String[] args) {

String[] sentence_to_check = new String[args.length];
int Breath_analyzer_number; // the result we are looking for

//read the sentence which has to be tested from command line/file etc
if (args.length > 0) {
sentence_to_check = args;
}

//create an instance and determine the Levenshtein distance
Word_distance sentence_LD = new Word_distance();

Breath_analyzer_number = sentence_LD.sentenceAnalysis(sentence_to_check);
System.out.println("The number for the given sentence is " +Breath_analyzer_number);

}

public Word_distance() {
String filename = "C:\\tempfiles\\word_list_Levenshtein.txt";
String line_read;
try {
FileReader word_list_file = new FileReader(filename);
BufferedReader word_buffer = new BufferedReader(word_list_file);

while ((line_read = word_buffer.readLine()) != null) {
//so reading the list of "legal" words into the ArrayList
legal_word_list.add(line_read);
}
word_list_file.close();

} catch (Exception e) {
System.out.println("Exception is " +e);
}
}

public int sentenceAnalysis(String[] sentence) {

// get each word of sentence and determine Levenshtein number for word
//have to iterate the given word over entire legal_word_list
//and determine lowest value; if zero then break, repeat for all words
int total_distance =0;
int[] word_distance = new int[sentence.length] ; //minimum word_distance for each word in the sentence
int temp_distance;


for (int i=0;i< sentence.length; i++) {

for (int j=0; j< legal_word_list.size();j ++) {
temp_distance = getLevenshteinDistance(sentence[i], legal_word_list.get(j).toLowerCase());
// temp_distance = LD(sentence[i].trim(), legal_word_list.get(j).toLowerCase());

if (temp_distance ==0) {
word_distance[i] = temp_distance;
break;
} else if (j==0) {
word_distance[i] = temp_distance;
} else if (word_distance[i] > temp_distance) {
word_distance[i] = temp_distance;
} else {
//DO NOTHING as new word distance is larger than what we had
}
}
total_distance = total_distance + word_distance[i];
}
return total_distance;
}


// the following code copied from http://www.merriampark.com/ldjava.htm by Chas Emerick
public int getLevenshteinDistance (String s, String t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}

/*
The difference between this impl. and the previous is that, rather
than creating and retaining a matrix of size s.length()+1 by t.length()+1,
we maintain two single-dimensional arrays of length s.length()+1. The first, d,
is the 'current working' distance array that maintains the newest distance cost
counts as we iterate through the characters of String s. Each time we increment
the index of String t we are comparing, d is copied to p, the second int[]. Doing so
allows us to retain the previous cost counts as required by the algorithm (taking
the minimum of the cost count to the left, up one, and diagonally up and to the left
of the current cost count being calculated). (Note that the arrays aren't really
copied anymore, just switched...this is clearly much better than cloning an array
or doing a System.arraycopy() each time through the outer loop.)

Effectively, the difference between the two implementations is this one does not
cause an out of memory condition when calculating the LD over two very large strings.
*/

int n = s.length(); // length of s
int m = t.length(); // length of t

if (n == 0) {
return m;
} else if (m == 0) {
return n;
}

int p[] = new int[n+1]; //'previous' cost array, horizontally
int d[] = new int[n+1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d

// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t

char t_j; // jth character of t

int cost; // cost

for (i = 0; i<=n; i++) {
p[i] = i;
}

for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;

for (i=1; i<=n; i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}

// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}

// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}


// following code also from web site; not my code
//*****************************
// Compute Levenshtein distance
//*****************************

public int LD (String s, String t) {
int d[][]; // matrix
int n; // length of s
int m; // length of t
int i; // iterates through s
int j; // iterates through t
char s_i; // ith character of s
char t_j; // jth character of t
int cost; // cost

// Step 1

n = s.length ();
m = t.length ();
if (n == 0) {
return m;
}
if (m == 0) {
return n;
}
d = new int[n+1][m+1];

// Step 2

for (i = 0; i <= n; i++) {
d[i][0] = i;
}

for (j = 0; j <= m; j++) {
d[0][j] = j;
}

// Step 3

for (i = 1; i <= n; i++) {

s_i = s.charAt (i - 1);

// Step 4

for (j = 1; j <= m; j++) {

t_j = t.charAt (j - 1);

// Step 5

if (s_i == t_j) {
cost = 0;
}
else {
cost = 1;
}

// Step 6

d[i][j] = Minimum (d[i-1][j]+1, d[i][j-1]+1, d[i-1][j-1] + cost);

}

}

// Step 7

return d[n][m];

}



//****************************
// Get minimum of three values
//****************************

private int Minimum (int a, int b, int c) {
int mi;

mi = a;
if (b < mi) {
mi = b;
}
if (c < mi) {
mi = c;
}
return mi;

}




}

2 comments:

AMedOs said...

i just want to share my c++ solution for this puzzle i have included the method for calculating the distance

/**
* @Author :Ahmed M. Osman
* @CodeJame userName : AMedOs
* @TopCoder Handel : AMedOs
* */
#include
#include
#include
#include

#include
#include
#include
#include
#include
#include
#include
#include

using namespace std;

/**
* start of template
*/
#define pb push_back
#define sz size()
#define loop(i,m) for(ui i=0;i vpii;
typedef vector vs;
typedef vector vvs;
typedef vector vi;
typedef vector > vvi;
typedef long long ll;
typedef unsigned int ui;

vs all;
vs wall;
vi wallcost;
int wordcost;
ui LevenshteinDistance(string s, string t);
ui LevenshteinDistance2(string s, string t);
int main (int argc, char *argv[]){
ifstream tmpin("/var/tmp/twl06.txt");
ifstream fin(argv[1]);
string stmp;

tmpin >> stmp;

while(tmpin){
all.pb(stmp);
tmpin >> stmp;
}

fin >> stmp;
while(fin){
wall.pb(stmp);
fin >> stmp;
wallcost.pb(0);
}

int total=0;
int res=0;
bool found =false;
loop(i,wall.sz){
found =false;
loop(k,i){
if(wall[i].compare(wall[k])==0){
total += wallcost[k];
wallcost[i] = wallcost[k];
found = true;
break;
}
}

if (found) continue;

wordcost = INT_MAX;
loop(j,all.sz){
res = LevenshteinDistance2(wall[i],all[j]);
wordcost = min(wordcost,res);
}
wallcost[i] = wordcost;
total += wordcost;
}
cout << total <<endl;

return 0;
}

ui LevenshteinDistance(string s, string t){
ui m = s.sz;
ui n = t.sz;

// d is a table with m+1 rows and n+1 columns
ui d[m+1][n+1];

loop(i,m+1)d[i][0] = i; // deletion

loop(j,n+1)d[0][j] = j; // insertion

loop4(i,1,m+1){
loop4(j,1,n+1){
if (tolower(s[i-1]) == tolower(t[j-1])){
//if (s[i-1] == t[j-1]){
d[i][j] = d[i-1][j-1];
}
else{
d[i][j] = min(
d[i-1][j] + 1,min( // deletion
d[i][j-1] + 1, // insertion
d[i-1][j-1] + 1 // substitution
));
}
}
}
return d[m][n];
}
ui LevenshteinDistance2(string s, string t){
ui n = s.sz; // length of s
ui m = t.sz; // length of t

if (n == 0) {
return m;
} else if (m == 0) {
return n;
}

int *p = new int[n+1]; //'previous' cost array, horizontally
int *d = new int[n+1]; // cost array, horizontally
int *_d; //placeholder to assist in swapping p and d

// indexes into strings s and t
ui i; // iterates through s
ui j; // iterates through t

char t_j; // jth character of t

int cost; // cost

for (i = 0; i<=n; i++) {
p[i] = i;
}

for (j = 1; j<=m; j++) {
t_j = tolower(t[j-1]);
d[0] = j;

for (i=1; i<=n; i++) {
cost = tolower(s[i-1])==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = min(min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}

// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}

// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}

AMedOs said...

the missing includes is
#include < stdio.h >
#include < stdlib.h >
#include < string.h >
#include < math.h >

#include < iostream >
#include < vector >
#include < fstream >
#include < sstream >
#include < map >
#include < string >
#include < algorithm >
#include < cmath >
#include < climits >