Skip to content

Morphological Operations – Music Reader

August 24, 2012

We now attempt to apply the techniques of morphological operations to create a singing program. Scilab is capable of producing sounds of certain pitches using the function sound() and by varying the frequency and time input. Utilizing this, along with application of morphological operations, we will try to have the program read an image of a musical score and have it sing appropriately while minimizing the need for human input.

I obtained a jpg file of the song jingle bells from the internet. We invert the image so that it becomes white on black, in order to make the processing simpler.

Our two concerns are as follows: identifying the location of the notes in relation to the staff, and recognizing the type of note. The position of the notes will dictate the tone of that particular note, while the type of note recognized will determine how long the note will be sung.

In order to identify the notes, we need to sample the music sheet itself. However, the notes need to be retrieved independent of the staff, so it is not a simple case of cropping the image. In order to retrieve the images of the notes, we simply erode the image with a vertical structuring element, enough to remove thin horizontal lines. After which, we simply crop the different images of each type of note. We do the same for the rests that appear. It should be noted that each type of note should be sampled, meaning if a note is not consistent throughout the image, more samples need to be taken.

Now that we have reference images as to what the notes look like in that particular music sheet, we can then proceed to locate the locations of each note type on the image. We can do this by simply eroding the image using the notes as our structuring element. However, we do encounter a small hiccup. The half note, when used as a structuring element, also picks up the locations of the eighth notes and the quarter notes. The quarter note also picks up the eighth notes. We remedy this by performing a form of NAND operation of all the notes with overlapping notes. Meaning should the locations of a half note and a quarter note coincide, the half note will delete this location. However, it may also be possible that the points are separated by some distance due to the nature of the structuring element. If this were the case, the locations do not overlap and the NAND will be useless. We remedy this by performing a small scale scan in the immediate surroundings, locating if there are points that are too close.

Locations of half notes (left) and quarter notes (right). Points have been enhanced for visual purposes.

Another thing to note is that since we are using the eroded images of the notes, its possible that two adjacent points appear. We simply pass the resulting image through another erosion (with a conditional to ignore already singular pixels) and we are left with singular points.

At this point, we have identified the different notes in the image and have localized them to certain points in the image. Now we need to figure out a way of correlating these locations to the location of the staff. We start by localizing the staff. Similar to how we located the notes, we just sample the staff and erode the entire image with it. The result should be points indicating where the staff is.

A little calibration is needed at this point, particularly in my case since the erosion function I have been using is my own personal code. By comparing the location of the staff point and the location of the note point and referring to what that tone that note plays, we now have a basis for the calibration of the different tones. Once this is done, the program then simply compares the distance of the note point to the staff point and it can already designate an appropriate frequency. It should be noted that it is best that the program be calibrated for each sampled note. Differences in the size and appearance of the sampled notes may cause changes as to where the localizing point will appear. For the rests, these do not need to be calibrated to the staff, as we simply assign them a frequency below the audible range.

After this, we scan the image in relation to the staff position and build a sound matrix, following the notes. The result is these sound files (separated due to size restrictions in Scilab).

https://www.yousendit.com/download/TEhYRkJVNkdrWTlMWE1UQw

https://www.yousendit.com/download/TEhYRkJVNkdveE1UWThUQw

https://www.yousendit.com/download/TEhYRkJVNkdvQUxxYk1UQw

The resulting music is mostly correct with only a few off-key notes. But other than those few notes, the program was successful in reading the music sheet and playing the corresponding music.

I should note here that I only had access to the SIVP toolbox of Scilab. As such, the methods used in order to obtain the goals may seem a bit roundabout compared to other methods. Still, despite the crude method and over 800 lines of code, the method is still viable. The only necessary times when human input was needed was during the sampling of the images and the calibrating of the notes. Other than this, the program is hands-free.

One huge drawback to this program is the time necessary to run it. The longest part of the code is localizing the notes which would take about 50 mins for each sampled note pair (upright and upside down notes) for a 789×959 image, and the code was already optimized to search in the immediate area of the staff. In this case, I had 5 note pairs and and the rests. To run the program, at least from the point where the notes have already been sampled, would take well over 4 hours. This can be improved by reducing the number of sampled notes. As mentioned earlier, there are cases when notes do not resemble each other exactly. A possible remedy would be to modify the erosion such that instead of outputting a binary result, is to output a grayscale result with the value reflecting the similarity between the structuring element and the location on the image.

For this activity, since I was a little late late in posting this, I will be giving myself 9/10.

From → Uncategorized

Leave a Comment

Leave a comment