Matt Canty's Writing Place

Text Message Analysis

August 25, 2011

Idea

My phone was getting cluttered, and this affects aspects such as loading times, and general system speed. So I decided to delete all my messages. But I didn’t want to lose them forever. So I wanted to store them, which then gave me the idea – text analysis to find out the most common words. It would make good practice and should have a fun output. Here goes.

To begin with…

I did some research on text-message backup Android applications and came across SMS Backup. I wouldn’t recommend this app – I will explain. So once the app was installed, I ran it and clicked back up SMS from the main menu. It gave me a stupid name for the file to remember but I saw it was a .bak file so that would make it easy. Plugged the phone into the laptop and searched the memory card for *.bak, there it was. On opening it in notepad I could see it looked a right mess.

nova_vers1-0-1,1314033236680,INBOX,2929,ENVIATS,2732,BORRADOR,0,FAILED,1,INBOX_ini,0#1a4)1#1a4)2#1a4)2#1a4)+447754******#1a4)11-24-2010 #1a4)null#1a4)1290642384870#1a4)Jane Dodson#k1ah4dc#Long film! I love you x#k1ah4dc#0#1a4)1#1a4)4#1a4)20#1a4)+447814******#1a4)11-25-2010 #1a4)null#1a4)1290676930055#1a4)Mum#k1ah4dc#Milk needed if you home before 5#k1ah4dc#0#1a4)1#1a4)5#1a4)21#1a4)My 3#1a4)11-25-2010 #1a4)null#1a4)1290678961827#1a4)My 3#k1ah4dc#Thanks for registering with My 3. Your web password is: ****** but you can change this by going to three.co.uk/my3 and clicking on My security.#k1ah4dc#0#1a4)1#1a4)6#1a4)21#1a4)My 3#1a4)11-25-2010 #1a4)null#1a4)1290678964013#1a4)My 3#k1ah4dc#Thanks for registering with My 3. Your web password is: ***** but you can change this by going to three.co.uk/my3 and clicking on My security.#k1ah4dc#0#1a4)1#1a4)7#1a4)16#1a4)+44785******2#1a4)11-25-2010 #1a4)null#1a4)1290702******#1a4)Kerri Louise-owens#k1ah4dc#Hey you. What are your plans for this weekend? Dog goes on nights on Sunday so we need to get out his weekend! x#k1ah4dc#0

Before doing any kind of code, I needed to understand its form. I just put new lines in a various points until it looked good, like this…

nova_vers1-0-1,1314033236680,INBOX,2929,ENVIATS,2732,BORRADOR,0,FAILED,1,INBOX_ini,0 #1a4)1 #1a4)2 #1a4)2 #1a4)+447754****** #1a4)11-24-2010 #1a4)null #1a4)1290642384870 #1a4)Jane Dodson#k1ah4dc#Long film! I love you x#k1ah4dc#0 #1a4)1 #1a4)4 #1a4)20 #1a4)+447814****** #1a4)11-25-2010 #1a4)null #1a4)1290676930055 #1a4)Mum#k1ah4dc#Milk needed if you home before 5#k1ah4dc#0

EDIT – Before I go on, at this point or near-abouts, I thought, this app sucks and I want to go back, try something else. So I opened the app, clicked restore messages and ERROR – no backup file exists! This is not true, as I did not remove the file, I copied. Also when I first did it the app managed to show me the data on the phone, now it won’t. All I did was unplug and plug it back in again.

By laying out like that we can start to assign lines with titles. The first half of the file is INBOX and following roughly halfway through the file – or at INBOX_fi,ENVIATS_ini – it becomes the OUTBOX. This is the first two messages this phone received to my phone. Not sure what the 1 or the null are referring too, as they remain as such throughout the document. My suspicion is the null could determine is the message contains media, as the application left all MMS messages on the phone.

In order – not including row 1 and 6 (null) – they are as follows:

Message Number, Message Thread, From/To Number, Date, Timestamp, Message Content.

So now I wanted to convert this jumbled up data into something I am happier to see…

From .Bak to MySQL with PHP

To figure this one out I just used a bit of paper and wrote down what I need to do. At this point I should point out I made a copy of the .bak file, one called INBOX.bak and the other OUTBOX.bak, this way before entering it all into the database I can assign a suitable IN/OUT tag. I also removed the top line.

1. Prepare database

Created a database in the usual way, called the table texts and gave each of the columns suitable types to handle the data correctly. This took some trial and error.

2. Load the file

$textData = file_get_contents(“INBOX.bak”);

3. Split the data between each of the “#la4)”. This leaves an array of message parts.

$exData = explode(“#1a4)”, $textData);

4. Go through the $exData array and prepare for MySQL INSERT query

To do this I used a foreach loop with $exData as $part. Within the loop a switch statement acts on whether it is on line 1, 2, 3, 4, 5, 6, 7 or 8 – for example:

case 4: $date = $part; break;

When i gets to 8, I reset the i variable to 0 and perform the query. The query was as follows:

mysql_query(“INSERT INTO texts VALUES (”,’1′,’$msgNum’,’$thread’,’$fromTo’,’$date’,’in’,’$time’,’$message’)”);

Before I did this I did various tests using echo, just to check the data was being organised correctly and everything was in the right place – notice I have written “in” in the 7th input. The first input is the PRIMARY KEY which is automatically generated by the database.

5.

When the file is ready, don’t worry about hitting refresh again and again, as the database will automatically drop duplicates – although you may want to be careful as I am using a local database. Now I have a database with all my messages, and if I click sort by message number column, I can see it all in order.

Cleaning the Data

Before doing any analysis, it’s probably best to get rid of the strange characters in the messages. I just selected the message and ID rows from the whole data base, did a string replace and updated the database. Like:

$string = str_replace(“#k1ah4dc#”, “”, $string);

That’s replacing #k1ah4dc# with “” – nowt.

Analysis

Here I selected all the messages from the database and concatenated a single bigString string. This string is then exploded into words and placed into an array called $text, deliminated by a space. Then foreach word – over 4 letters long, and not in the “excluded” array, is placed into an array called $words. This is done by using the $counts = array_count_values($words); function and followed by an array_multisort($counts); to get it all in order.

Once done, the $count array now contains each word and its associated count, loop through the top 20 and echo out a bit of text to make it easy to read.

foreach ($count as $key => $value) { if ( $i <= 20 ) { echo ‘Number of occurances: ‘.$value.’ of word: ‘.$key.”n”; $i++; } }

Results!

The results are below, I’m not going to put them online. I’m not going to do anything with the results either. I just enjoyed getting them. Probably says something about me – both the results and enjoying the process. Next I plan to recreate the Android messaging layout so I can continue to access all my old messages Open-mouthed smile

Over 4 Letters

Count: 434 Word: about
Count: 278 Word: there
Count: 252 Word: think
Count: 204 Word: going
Count: 184 Word: today
Count: 180 Word: would
Count: 176 Word: could
Count: 174 Word: where
Count: 164 Word: still
Count: 154 Word: tomorrow
Count: 154 Word: should
Count: 154 Word: thats
Count: 152 Word: tonight
Count: 144 Word: coming
Count: 140 Word: night
Count: 120 Word: though
Count: 110 Word: right
Count: 108 Word: doing
Count: 102 Word: round
Count: 100 Word: thanks

Over 6 letters

Count: 154 Word: tomorrow
Count: 152 Word: tonight
Count: 94 Word: evening
Count: 84 Word: getting
Count: 76 Word: watching
Count: 74 Word: thinking
Count: 72 Word: morning
Count: 70 Word: something
Count: 56 Word: facebook
Count: 54 Word: probably
Count: 52 Word: looking
Count: 52 Word: birthday
Count: 52 Word: christmas
Count: 50 Word: message
Count: 46 Word: anything
Count: 44 Word: thought
Count: 44 Word: lancaster
Count: 44 Word: heading
Count: 40 Word: finished
Count: 38 Word: everyone

Over 3 Letters

Count: 944 Word: love
Count: 638 Word: your
Count: 628 Word: just
Count: 530 Word: that
Count: 530 Word: have
Count: 506 Word: what
Count: 502 Word: will
Count: 434 Word: about
Count: 424 Word: baby
Count: 422 Word: with
Count: 370 Word: come
Count: 364 Word: woof
Count: 316 Word: want
Count: 312 Word: good
Count: 300 Word: when
Count: 294 Word: home
Count: 282 Word: yeah
Count: 278 Word: there
Count: 274 Word: this
Count: 272 Word: time

Over 5 letters

Count: 154 Word: tomorrow
Count: 154 Word: should
Count: 152 Word: tonight
Count: 144 Word: coming
Count: 120 Word: though
Count: 100 Word: thanks
Count: 94 Word: evening
Count: 94 Word: really
Count: 94 Word: sounds
Count: 84 Word: getting
Count: 76 Word: please
Count: 76 Word: watching
Count: 74 Word: thinking
Count: 72 Word: morning
Count: 70 Word: something
Count: 62 Word: having
Count: 58 Word: lovely
Count: 56 Word: campus
Count: 56 Word: london
Count: 56 Word: facebook

Questions and comments below!!!


I am Matt Canty, an engineer of varying titles who helps tech companies run reliable and fault-tolerant software platforms. Just send me an email: matthewcanty@gmail.com.

This blog is also available as an RSS feed.