Pasta dataset

Ever been curious about the distribution of letters and other symbols in alphabet pasta?

alphabet pasta

Here comes the pasta dataset: three different kinds of alphabet pasta, thousands of raw photos of pasta pieces on a blue background, and gray-scale images of more than 20000 single pieces of pasta extracted from the raw images.

pieces of alphabet pasta extracted from the cam image

The dataset is intended for training students on semi-supervised learning (clustering plus manual labeling of clusters) and other machine learning techniques with real-world data. An interesting task is to count all pieces of pasta and get relative frequencies of all letters.

Photos have been taken automatically by a LEGO machine (see below for details and video).

Download

ZIP file with PNG and JPG images (1.1 GiB): pasta.zip

The ZIP file contains six folders named raw_xyz and noodles_xyz, where xyz is one of albhof, gaggli, riesa (three different products).

License

Raw and processed images are licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (summary). The dataset as a whole is licensed under the Open Data Commons Open Database License (ODbL) v1.0 (summary).

Size and contents

The dataset contains images of 750g alphabet pasta, 250g per type/product. Products are labeled albhof, gaggli, riesa.

‘albhof’ and ‘riesa’ images look very similar (letters A-Z), whereas ‘gaggli’ is different (A-Z, 0-9, some special symbols). Details on image formats and file name conventions are provided below.

The dataset contains more than 3300 raw cam images, which show more than 23000 pieces of pasta depicted in almost 22000 processed images.

Recording technique

A LEGO machine based on a LEGO Spike Prime set and corresponding expansion set takes some pieces of pasta from a storage container and places them under a usual webcam. The cam takes an image, which then gets analyzed for quality.

If the image is okay, that is, if at least half the pasta pixels belong to isolated pieces of pasta (not too many conglomerations), pieces get removed from the machine. If the image is not okay, all pasta pieces are put into a separate container and the image is dropped. Dropped paste pieces are put back into the storage container from time to time. Thus, each piece of pasta finally will be shown on some image of sufficient quality.

Here is a video showing the machine in action:

I do not publish the algorithm for analyzing the images here, because thinking about the problem and developing an algorithm is a regular task for my students. If you are interested in the details, please contact me via email.

Raw camera images

The dataset contains raw (that is, unprocessed) images from the webcam (1280x720 JPG file). Files are named cam_TIMESTAMP_noodle.jpg, where TIMESTAMP is the recording time (Unix time in nanoseconds).

raw image from camera

Each 200 or fewer images there is an image without any pasta pieces, named cam_TIMESTAMP_empty.jpg. Those images may be used for color calibration, correcting small alterations of cam position and lighting, and so on.

The blue background got dirtier and dirtier during the recordings, which took about 20 hours in total. From time to time I cleaned it with mediocre success. Pasta-less images document the plate’s condition.

Images of single pasta pieces

Next to raw images the datasets contains images showing only one single piece of pasta. The extraction process uses simple image processing techniques (no machine learning stuff). I do not publish the details here (else my students would sneak peaks at this webpage). If you’re interested, please contact me via email.

The images are stored as gray-scale PNG files of different size (as small as possible). Color range of each image is scale to the maximum, that is, there’s always a black and a white pixel. File names have the form SUBSET_TIMESTAMP_INDEX.png. Here, TIMESTAMP is the time stamp of the raw cam image from which the pasta piece has been extracted. INDEX is a consecutive number denoting the different pieces of pasta in the cam image. SUBSET is one of cut, noodle, glued and describes the quality of the image.

Files with ‘cut’ contain images of pasta pieces lying at the cam image’s border. Such pieces haven’t been recorded fully.

cut piece of pasta

Files with ’noodle’ are the good ones: one piece of pasta per file. Images have been rotated to have all images showing the same letter in identical orientation. That’s not necessarily the correct orientation. But it’s a consitent one (up to some exceptions, where there are two different orientations for a letter).

good image of pasta piece

Files with ‘glued’ show more than one piece of pasta. Whether a file is labeled ’noodle’ or ‘glued’ solely depends on the height of the image. Thus there might be misslabelings, especially for the ‘gaggli’ type of pasta, which contains very differently sized pieces of pasta.

glued pieces of pasta

Log files

Each folder with processed images contains a log file created during processing. For each raw cam image there are two lines (except for the empty calibration images, which only have the file name line) showing the raw image’s file name (line 1) and information on image quality (line 2). Image quality is described by two numbers, labeled ‘okay’ and ‘dropped’. The ‘okay’ value is the number of pasta pixels belonging the an isolated piece of pasta (resulting in a noodle_....png file). The ‘dropped’ value is the number of pasta pixels belonging to pieces of pasta shown in cut....png and glued....png files.

The last two lines of each log file show the total number of good and dropped pasta pixels. From those values and the number of noodle....png files one can estimate the total number of pasta pieces in the dataset.