Don’t you wish you could find a song by humming a few bars? You can’t remember the lyrics, but the tune is stuck in your head. Or maybe it’s an instrumental and there are just no lyrics to search for. Well, there is a website where you can sing a few notes into your microphone and they will search for music that matches what you sang.
I’ve been doing this with the radio station for years. There’s always a late night DJ you can call who will happily tell you the name of a song just hearing you hum a few notes over the phone. This is a fantastic service and to computerize the whole process is definitely a step in the right direction.
I have no idea how Google works, but in my opinion it’s the most useful search engine. It doesn’t allow me to search by sound. This would be a useful addition to their search capabilities. It got me thinking about a few things.
Lately I’ve been using the Lua scripting language. Lua features a data structure called a table that is a dictionary of keys and their associated values. You associate a key with a value. When you provide the key, you can easily retrieve the associated value. The interesting thing in Lua is that keys can be any data type. When keys are strings it’s like a dictionary mapping. When keys are integers it’s like an array. But keys can be other things like functions, userdata, or even other tables. I have the x and you have the associated y. Tables can do this for you because they are manually fed all the associations. I am the gatekeeper. Are you the keymaster?
Obviously the fuzzy searching is extremely useful. In Lua if you misspell a key name, you don’t get your value. But with Google, you can pretty much completely mangle your search text and still be directed to something associated with that text. But why is it limited to text?
The miracle of Google has to be in how it associates data and how it ranks those associations. So Google should really start associating other kinds of data other than text. Now it’s true that you can search for images. But you can’t search by images. This reveals a weakness in their association of images. Images are found by the text that appears near them. The association is done by analyzing text annotations, and by virtue of the images proximity to other text. No processing is done on the actual image other than its size. You can search for small, medium, or large images. But wouldn’t it be cool if it actually looked at the image and tried to make sense of it? Obviously this is hard. But Google taps into the fact that people do it for them. People caption photos, and stick pictures on web pages where the photos supplement the information on the page.
Let’s go back a step and consider audio again. In the case of midomi.com, it actually does analyze the sound. I’m pretty sure it processes the raw audio and produces some magic numbers. It does the same processing on the audio that you sing into your microphone and when the magic numbers from the song more-or-less match the magic numbers from your singing, you get a match.
So the genius of making such a system work lies in the choice of how to analyze the audio to produce the magic numbers. There is lots of information in an audio waveform but not all of it is discerning information. And it doesn’t make sense to just remove the inaudible stuff. Obviously the mp3 encoding of the original song and the mp3 encoding of my singing are going to be different. You have to throw away even more information until you are left with only what restricts your classification of what it is that you are hearing. And you may wish to do this in several domains. Search by melody is probably what you want though. It’s unlikely that you will want to search by loudness or tempo. But if you’re a marching band and you want recordings of music at exactly 90 bpm then go for it.
The thing is that it takes effort to analyze, process, classify and associate information. This process is called indexing, and it takes time. But Google doesn’t seem to mind classifying all the text on the internet. We just need to expand that a bit into more types of information.
I would like to be able to search by sound, image, text, or even any binary file. Virus scanners already scan files looking for virus signatures. You could imagine searching by virus.
Google largely makes use of contextual information. If it sees a picture on a page about Abraham Lincoln, then there’s a decent chance the picture is relevant to Abe. It might even be a picture of the guy; chances are, anyway. It just needs to do more analysis and more indexing.
