PowerBI

I think when most of us have some data to sort or make sense of, we tend to gravitate toward a spreadsheet. Using Excel or LibreOffice, or if you really like to party, OpenRefine. There are plenty of meme’s out there representing the frustration people have with bugs, features and limitations of Excel specifically.

There are more tools out there for making sense of data, one some people have access to is Microsoft’s more advanced PowerBI tool. Marketed as a Data Visualization tool it is accessible to many with a Office 365 subscription. It offers expanded features than excel and isn’t as limited in row maximums.

PowerBi was recently the topic of a Code4Lib editorial issue. The writer of an article for their journal posted two PowerBI datasets which a reader later noticed had private data. After some miscommunications and misunderstandings an open letter was drafted and received some support. Code4Lib did release a statement and lessons were learned.

One statement from the Code4Lib staff caught my eye. “The released files were in a proprietary file format, Microsoft Power BI, with which none of the editors have experience.”

We all use tools for our jobs we are most familiar or available to us. No one can be an expert in all file formats. Some us try, but things change so fast it is impossible. But, we can do more in documenting and making formats identifiable through the tools we use for digital preservation. The File Format Wiki and PRONOM have had no mention of Power BI, so let’s change that.

Microsoft Power BI was released in 2011 and has been part of the Microsoft Power Platform. Power BI can gather data from many sources. The software can be accessed in the Office 365 cloud, but also using a Desktop application. In the desktop application, all the data sources and connections are stored in a single file with the extension PBIX. But there are other related formats.

filename : 'PowerBI-Test.pbix'
filesize : 401951
modified : 2024-02-22T11:29:41-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/263'
    format  : 'ZIP Format'
    version : 
    mime    : 'application/zip'
    class   : 'Aggregate'
    basis   : 'byte match at [[0 4] [401867 3] [401929 4]]'
    warning : 'extension mismatch'

Path = PowerBI-Test.pbix
Type = zip
Physical Size = 401951

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2024-02-22 18:29:40 .....            8           10  Version
2024-02-22 18:29:40 .....          488          230  [Content_Types].xml
2024-02-22 18:29:40 .....       397312       397312  DataModel
2024-02-22 18:29:40 .....         2848          882  Report/Layout
2024-02-22 18:29:40 .....          328          161  Settings
2024-02-22 18:29:40 .....          136          120  Connections
2024-02-22 18:29:40 .....        18972         1733  Report/StaticResources/SharedResources/BaseThemes/CY24SU02.json
2024-02-22 18:29:40 .....          358          357  SecurityBindings
------------------- ----- ------------ ------------  ------------------------
2024-02-22 18:29:40             420450       400805  8 files

Just like many modern Microsoft formats it is a ZIP container with a mixture of XML and JSON. There is also a DataModel file along with Settings and Connections. A quick peek at some of the contents shows us:

hexdump -C PowerBI-Test/Version | head
00000000  31 00 2e 00 32 00 38 00                           |1...2.8.|

hexdump -C PowerBI-Test/DataModel | head
00000000  ff fe 53 00 54 00 52 00  45 00 41 00 4d 00 5f 00  |..S.T.R.E.A.M._.|
00000010  53 00 54 00 4f 00 52 00  41 00 47 00 45 00 5f 00  |S.T.O.R.A.G.E._.|
00000020  53 00 49 00 47 00 4e 00  41 00 54 00 55 00 52 00  |S.I.G.N.A.T.U.R.|
00000030  45 00 5f 00 29 00 21 00  40 00 23 00 24 00 25 00  |E._.).!.@.#.$.%.|
00000040  5e 00 26 00 2a 00 28 00  3c 00 42 00 61 00 63 00  |^.&.*.(.<.B.a.c.|
00000050  6b 00 75 00 70 00 4c 00  6f 00 67 00 3e 00 3c 00  |k.u.p.L.o.g.>.<.|
00000060  42 00 61 00 63 00 6b 00  75 00 70 00 52 00 65 00  |B.a.c.k.u.p.R.e.|
00000070  73 00 74 00 6f 00 72 00  65 00 53 00 79 00 6e 00  |s.t.o.r.e.S.y.n.|
00000080  63 00 56 00 65 00 72 00  73 00 69 00 6f 00 6e 00  |c.V.e.r.s.i.o.n.|
00000090  3e 00 31 00 34 00 30 00  3c 00 2f 00 42 00 61 00  |>.1.4.0.<./.B.a.|

hexdump -C PowerBI-Test/\[Content_Types\].xml | head
00000000  ef bb bf 3c 3f 78 6d 6c  20 76 65 72 73 69 6f 6e  |...<?xml version|
00000010  3d 22 31 2e 30 22 20 65  6e 63 6f 64 69 6e 67 3d  |="1.0" encoding=|
00000020  22 75 74 66 2d 38 22 3f  3e 3c 54 79 70 65 73 20  |"utf-8"?><Types |
00000030  78 6d 6c 6e 73 3d 22 68  74 74 70 3a 2f 2f 73 63  |xmlns="http://sc|
00000040  68 65 6d 61 73 2e 6f 70  65 6e 78 6d 6c 66 6f 72  |hemas.openxmlfor|
00000050  6d 61 74 73 2e 6f 72 67  2f 70 61 63 6b 61 67 65  |mats.org/package|
00000060  2f 32 30 30 36 2f 63 6f  6e 74 65 6e 74 2d 74 79  |/2006/content-ty|
00000070  70 65 73 22 3e 3c 44 65  66 61 75 6c 74 20 45 78  |pes"><Default Ex|
00000080  74 65 6e 73 69 6f 6e 3d  22 6a 73 6f 6e 22 20 43  |tension="json" C|
00000090  6f 6e 74 65 6e 74 54 79  70 65 3d 22 22 20 2f 3e  |ontentType="" />|

So it looks like the ZIP structure follows the standard for OpenXML packages as it contains a “[Content_Types].xml” file. So using this XML alone would clash with too many other formats. From what I could find the “DataModel” file is what stores the data is more unique to this format, even though the name is pretty generic. Using a string within the file would probably help be more accurate. The “DataModel” file does have unicode double byte strings we can use. “STREAM_STORAGE_SIGNATURE” seems like a unique enough string to use, but it looks like it may not be unique to PBIX. Looks like the “DataModel” file is a Microsoft “MS-XLDM” file format and is a “Spreadsheet Data Model File Format“.

There is a variation to the DataModel file and I am not sure when the standard is used verses this variation, “This backup was created using XPress9 compression”. Not sure if it is versioning or how the file is saved, but they both seem to function correctly.

hexdump -C DataModel | head
00000000  54 00 68 00 69 00 73 00  20 00 62 00 61 00 63 00  |T.h.i.s. .b.a.c.|
00000010  6b 00 75 00 70 00 20 00  77 00 61 00 73 00 20 00  |k.u.p. .w.a.s. .|
00000020  63 00 72 00 65 00 61 00  74 00 65 00 64 00 20 00  |c.r.e.a.t.e.d. .|
00000030  75 00 73 00 69 00 6e 00  67 00 20 00 58 00 50 00  |u.s.i.n.g. .X.P.|
00000040  72 00 65 00 73 00 73 00  39 00 20 00 63 00 6f 00  |r.e.s.s.9. .c.o.|
00000050  6d 00 70 00 72 00 65 00  73 00 73 00 69 00 6f 00  |m.p.r.e.s.s.i.o.|
00000060  6e 00 2e 00 00 00 00 b0  07 00 76 75 00 00 2a d7  |n.........vu..*.|
00000070  86 4e 00 b0 07 00 ad ab  03 00 2c cb 06 00 00 00  |.N........,.....|
00000080  00 00 f8 6c 86 7f 00 00  00 00 68 01 56 6e 00 00  |...l......h.Vn..|
00000090  20 82 67 49 52 06 00 f6  ab fc fc fe 2d f6 da 8b  | .gIR.......-...|

After a bit of digging it seems like the MS-XLDM format can be found within an XSLX file. I found an example with these datasets. Within an XSLX there can be a found a file “xl/model/item.data” and it has the same structure as DataModel within a PBIX.

hexdump -C Customer Profitability Sample-no-PV/xl/model/item.data | head
00000000  ff fe 53 00 54 00 52 00  45 00 41 00 4d 00 5f 00  |..S.T.R.E.A.M._.|
00000010  53 00 54 00 4f 00 52 00  41 00 47 00 45 00 5f 00  |S.T.O.R.A.G.E._.|
00000020  53 00 49 00 47 00 4e 00  41 00 54 00 55 00 52 00  |S.I.G.N.A.T.U.R.|
00000030  45 00 5f 00 29 00 21 00  40 00 23 00 24 00 25 00  |E._.).!.@.#.$.%.|
00000040  5e 00 26 00 2a 00 28 00  3c 00 42 00 61 00 63 00  |^.&.*.(.<.B.a.c.|
00000050  6b 00 75 00 70 00 4c 00  6f 00 67 00 3e 00 3c 00  |k.u.p.L.o.g.>.<.|
00000060  42 00 61 00 63 00 6b 00  75 00 70 00 52 00 65 00  |B.a.c.k.u.p.R.e.|
00000070  73 00 74 00 6f 00 72 00  65 00 53 00 79 00 6e 00  |s.t.o.r.e.S.y.n.|
00000080  63 00 56 00 65 00 72 00  73 00 69 00 6f 00 6e 00  |c.V.e.r.s.i.o.n.|
00000090  3e 00 31 00 35 00 30 00  3c 00 2f 00 42 00 61 00  |>.1.5.0.<./.B.a.|

Because this file has a different filename and is in a different path, using “DataModel” should keep identification specific to a PBIX file.

The Power BI Report has a template option. This format uses the .PBIT extension and doesn’t contain any data only a template to use with other data. The structure is roughly the same, but doesn’t contain the “DataModel” file, but “DataModelSchema”, which appears to be a JSON file.

hexdump -C DataModelSchema | head
00000000  7b 00 0d 00 0a 00 20 00  20 00 22 00 6e 00 61 00  |{..... . .".n.a.|
00000010  6d 00 65 00 22 00 3a 00  20 00 22 00 38 00 36 00  |m.e.".:. .".8.6.|
00000020  65 00 34 00 32 00 62 00  33 00 30 00 2d 00 30 00  |e.4.2.b.3.0.-.0.|
00000030  34 00 34 00 33 00 2d 00  34 00 36 00 30 00 63 00  |4.4.3.-.4.6.0.c.|
00000040  2d 00 61 00 36 00 66 00  36 00 2d 00 36 00 66 00  |-.a.6.f.6.-.6.f.|
00000050  34 00 35 00 35 00 66 00  64 00 64 00 31 00 61 00  |4.5.5.f.d.d.1.a.|
00000060  35 00 36 00 22 00 2c 00  0d 00 0a 00 20 00 20 00  |5.6.".,..... . .|
00000070  22 00 63 00 6f 00 6d 00  70 00 61 00 74 00 69 00  |".c.o.m.p.a.t.i.|
00000080  62 00 69 00 6c 00 69 00  74 00 79 00 4c 00 65 00  |b.i.l.i.t.y.L.e.|
00000090  76 00 65 00 6c 00 22 00  3a 00 20 00 31 00 35 00  |v.e.l.".:. .1.5.|

The DataModelSchema JSON has some plain text strings which could be used for identification. Later in the file there is a string, “defaultPowerBIDataSourceVersion“.

000001c0  20 00 20 00 20 00 7d 00  2c 00 0d 00 0a 00 20 00  | . . .}.,..... .|
000001d0  20 00 20 00 20 00 22 00  64 00 65 00 66 00 61 00  | . . .".d.e.f.a.|
000001e0  75 00 6c 00 74 00 50 00  6f 00 77 00 65 00 72 00  |u.l.t.P.o.w.e.r.|
000001f0  42 00 49 00 44 00 61 00  74 00 61 00 53 00 6f 00  |B.I.D.a.t.a.S.o.|
00000200  75 00 72 00 63 00 65 00  56 00 65 00 72 00 73 00  |u.r.c.e.V.e.r.s.|
00000210  69 00 6f 00 6e 00 22 00  3a 00 20 00 22 00 70 00  |i.o.n.".:. .".p.|
00000220  6f 00 77 00 65 00 72 00  42 00 49 00 5f 00 56 00  |o.w.e.r.B.I._.V.|
00000230  33 00 22 00 2c 00 0d 00  0a 00 20 00 20 00 20 00  |3.".,..... . . .|

Seems like the best identification of the template format.

As usual you can find my signature proposal on my GitHub along with a couple “safe” samples.

Picture It!

Most everyone has heard of Microsoft Office, the suite of applications used by millions everyday. Less people know about Microsoft Works, which was a lower cost alternative, but was quite popular as a home office suite of applications. One tool which often came with the Works suite was a digital image tool called Picture It!

Picture It! was a photo editing tool first released by Microsoft in 1996 geared to making photo editing easy and affordable.

Picture It! used a wizard type interface which walked you through acquiring an image and adding to it. One of the key features of the software was the ability to “stack” objects like layers. Because of this feature a new file format was used to save this information to disk. Meet the Microsoft Image (Picture) Extension format, commonly known as the MIX file format. It is very similar to the FlashPix image format, which was supposed to be an image file format to solve many delivery issues, but didn’t seem to gain hold despite being created by Kodak, HP, and others. In fact many of the MIX files I found on Microsoft disks are actually FlashPix files.

The MIX extension was also used by another Microsoft program, PhotoDraw, which causes confusion as they were similar, but PhotoDraw has some added features which may not be compatible with Picture It!. Both formats are based on the Microsoft Compound Object (OLE) container, and have a similar structure. Let’s take a look at a MIX file from Picture It! version 1.

7z l PictureIt1-s02.mix                 

--
Path = PictureIt1-s02.mix
Type = Compound
Physical Size = 48128
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
                    .....          328          384  [5]Data Object 000001
                    .....          396          448  [5]Transform 000004
                    .....          872          896  [5]Operation 000001
                    .....          320          320  [1]CompObj
                    .....          292          320  [5]Global Info
                    .....          872          896  [5]Operation 000002
                    .....          144          192  [5]Operation 000003
                    .....          684          704  [5]Transform 000008
                    .....         1028         1088  [5]Transform 000009
                    .....          328          384  [5]Data Object 000009
                    .....          324          384  [5]Data Object 000005
2023-12-27 11:04:39 D....                            Data Object Store 000001
                    .....          328          384  [5]Data Object 000010
                    .....        20932        20992  [5]SummaryInformation
                    .....          200          256  [5]Microsoft Embedding Info
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0001
                    .....         1400         1408  Data Object Store 000001/[5]Image Contents
                    .....          230          256  Data Object Store 000001/[1]CompObj
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0000
                    .....           28           64  Data Object Store 000001/Resolution 0000/Subimage 0000 Data
                    .....           80          128  Data Object Store 000001/Resolution 0000/Subimage 0000 Header
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0003
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0002
                    .....           28           64  Data Object Store 000001/Resolution 0002/Subimage 0000 Data
                    .....          208          256  Data Object Store 000001/Resolution 0002/Subimage 0000 Header
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0005
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0004
                    .....           28           64  Data Object Store 000001/Resolution 0004/Subimage 0000 Data
                    .....         1792         1792  Data Object Store 000001/Resolution 0004/Subimage 0000 Header
                    .....          124          128  Data Object Store 000001/[5]SummaryInformation
                    .....           28           64  Data Object Store 000001/Resolution 0005/Subimage 0000 Data
                    .....         6976         7168  Data Object Store 000001/Resolution 0005/Subimage 0000 Header
                    .....           28           64  Data Object Store 000001/Resolution 0003/Subimage 0000 Data
                    .....          544          576  Data Object Store 000001/Resolution 0003/Subimage 0000 Header
                    .....           28           64  Data Object Store 000001/Resolution 0001/Subimage 0000 Data
                    .....          128          128  Data Object Store 000001/Resolution 0001/Subimage 0000 Header
------------------- ----- ------------ ------------  ------------------------
2023-12-27 11:04:39              38698        39872  29 files, 7 folders

This is a simple MIX file with one line of text, but contains a lot of content inside the OLE container. If I try and use the PRONOM registry to identify the file, I get:

sf PictureIt1-s02.mix 
---
siegfried   : 1.11.0
scandate    : 2023-12-27T11:06:32-07:00
signature   : default.sig
created     : 2023-12-17T15:54:41+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V116.xml; container-signature-20231127.xml'
---
filename : 'PictureIt1-s02.mix'
filesize : 48128
modified : 2023-12-27T11:04:40-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/111'
    format  : 'OLE2 Compound Document Format'
    version : 
    mime    : 
    class   : 'Text (Structured)'
    basis   : 'byte match at 0, 30'
    warning : 

Hmm, we know it is an OLE compound document, but it should identify as a Picture It! file as PRONOM has defined a PUID for the format. fmt/936 has been defined as “Microsoft Picture It! Image File 1”. So I am not sure why this file from version 1 is not identifying correctly. Let’s take a look. The PRONOM container signature for fmt/936 is looking for this:

    <ContainerSignature Id="17015" ContainerType="OLE2">
      <Description>Microsoft Picture It! Image File</Description>
      <Files>
        <File>
          <Path>CompObj</Path>
          <BinarySignatures>
            <InternalSignatureCollection>
              <InternalSignature ID="17015">
                <ByteSequence Reference="BOFoffset">
                  <SubSequence Position="1" SubSeqMinOffset="32"
                               SubSeqMaxOffset="32">
                    <Sequence>'Microsoft Picture It! version 1 Picture'</Sequence>
                  </SubSequence>
                </ByteSequence>
              </InternalSignature>
            </InternalSignatureCollection>
          </BinarySignatures>
        </File>
      </Files>
    </ContainerSignature>

The container signature is looking into the OLE container for the “CompObj” file (which seems to be required), then looks for the string “Microsoft Picture It! version 1 Picture” starting at the 32nd byte. That is pretty specific. The sample file I am using as an example has the following string of bytes.

hexdump -C PictureIt1-s02/\[1\]CompObj 
00000000  01 00 fe ff 03 0a 00 00  ff ff ff ff 00 68 61 56  |.............haV|
00000010  54 c1 ce 11 85 53 00 aa  00 a1 f9 5b 1e 00 00 00  |T....S.....[....|
00000020  4d 69 63 72 6f 73 6f 66  74 20 50 69 63 74 75 72  |Microsoft Pictur|
00000030  65 20 49 74 21 20 50 69  63 74 75 72 65 00 27 00  |e It! Picture.'.|
00000040  00 00 7b 35 36 36 31 36  38 30 30 2d 43 31 35 34  |..{56616800-C154|
00000050  2d 31 31 43 45 2d 38 35  35 33 2d 30 30 41 41 30  |-11CE-8553-00AA0|
00000060  30 41 31 46 39 35 42 7d  00 13 00 00 00 50 69 63  |0A1F95B}.....Pic|
00000070  74 75 72 65 49 74 21 2e  50 69 63 74 75 72 65 00  |tureIt!.Picture.|

Ok, so this sample has a similar string but is missing the “version 1” text. It seems the samples used to created the PRONOM signature was working off samples which included the version 1 in the header of CompObj. Maybe when Microsoft learned they would be making a version 2, they decided a version number should be included going forward. Let’s take a look a file from version 2 to compare:

hexdump -C PictureIt2-s01/\[1\]CompObj 
00000000  01 00 fe ff 03 0a 00 00  ff ff ff ff 50 28 72 2d  |............P(r-|
00000010  4b 8c d0 11 a9 6f 00 a0  c9 05 41 0d 28 00 00 00  |K....o....A.(...|
00000020  4d 69 63 72 6f 73 6f 66  74 20 50 69 63 74 75 72  |Microsoft Pictur|
00000030  65 20 49 74 21 20 76 65  72 73 69 6f 6e 20 32 20  |e It! version 2 |
00000040  50 69 63 74 75 72 65 00  27 00 00 00 7b 32 44 37  |Picture.'...{2D7|
00000050  32 32 38 35 30 2d 38 43  34 42 2d 31 31 44 30 2d  |22850-8C4B-11D0-|
00000060  41 39 36 46 2d 30 30 41  30 43 39 30 35 34 31 30  |A96F-00A0C905410|
00000070  44 7d 00 f4 39 b2 71 50  00 00 00 4d 00 69 00 63  |D}..9.qP...M.i.c|

Ok, so it looks like they did update the version string for version 2. This file also does not identify correctly. A quick look at the wikipedia page for Microsoft Picture It! tells us they continued to release the software until version 10. Is there a different string for each version?

Diving into this and gathering many samples has brought a lot of variants to surface. Let’s see if we can list all the CompObj header variants.

Version 1 samples:
Picture It! Picture'{56616800-C154-11CE-8553-00AA00A1F95B}
Microsoft Picture It! Picture'{56616800-C154-11CE-8553-00AA00A1F95B}
Microsoft Picture It! version 1 Picture'{56616800-C154-11CE-8553-00AA00A1F95B}
Picture It! Collage'{56616800-C154-11CE-8553-00AA00A1F95B}

Version 2 samples:
Microsoft Picture It! version 2 Picture'{2D722850-8C4B-11D0-A96F-00A0C905410D}

Version 3 samples:
Microsoft Picture It! version 3 Picture'{18B8D020-B4FD-11D0-A97E-00A0C905410D}

Version 4 samples:
Microsoft Picture It! version 4 Picture'{18B8D020-B4FD-11D0-A97E-00A0C905410D}

PhotoDraw version 1 samples:
Microsoft PhotoDraw version 1 Picture'{18B8D020-B4FD-11D0-A97E-00A0C905410D}

PhotoDraw version 2 samples:
Microsoft PhotoDraw version 2 Picture'{18B8D021-B4FD-11D0-A97E-00A0C905410D}

FlashPix samples:
FlashPix Object({56616000-C154-11CE-8553-00AA00A1F95B}
FlashPix Object({56616800-C154-11CE-8553-00AA00A1F95B}
Picture It! FlashPix'{56616700-C154-11CE-8553-00AA00A1F95B}
LPI FlashPix'{56616700-c154-11ce-8553-00aa00a1f95b}
FlashPix_Object'{56616700-C154-11CE-8553-00AA00A1F95B}
'{56616700-C154-11CE-8553-00AA00A1F95B}
Picture It!'{56616700-c154-11ce-8553-00aa00a1f95b}
Flashpix Toolkit Application'{56616700-c154-11ce-0000-000000000000}

Ok, there is a lot to discuss here. First of all, it seems MIX was only used in Picture It! until version 5 (2001), then the Picture It! software used a new format, PNG Plus to store the layered stacks. More on that in a future post! Although some later versions seems to be able to open the older MIX format. Version 4 of the MIX format seems to be the last as the 2001 software had only version 4 files on it. Probably safe to say only the 4 versions are needed for identification.

You may notice the additional unique identifier I included in each format. This is called a Class ID for the OLE format, which A LOT of formats use. Each “format” has a unique ID associated with it to help distinguish it from other formats. This Unique ID could possibly be a better solution for identification. It does cross over with the PhotoDraw format, but the FlashPix format seems to have a unique ID. With all the variations in the version 1 strings, the ID remains the same. For version 3 and 4 the ID is the same, which could mean they are interchangeable. It is also the same as PhotoDraw version 1. Not to complicate things.

So it seems in order to get proper identification of these similar formats we need to:

  • Clean up version 1 identification for fmt/936
  • Add a signature for 2, 3, and 4
  • Add a version 2 signature for the PhotoDraw format
  • Add some additional signature variations for the FlashPix format.

The Class ID’s could be used to distinguish different versions and formats, but many of the ID’s are identical, this could mean they are the same format. But for now we can just add the additional variation strings and it should identify everything for now. The FlashPix format needs more research as there is so many different variations and it’s so close to the MIX format. Take a look at my GitHub submission, maybe you have some additional variations to add?

Multiplan

This is a follow up post to the post “EARLY MICROSOFT EXCEL” earlier this year.

I have to admit, often when I am researching file formats I can get distracted by a shinier format I come across. I often go down rabbit holes and forget the reason I started down the path I am on. I try and focus on the current needs in my life as a Digital Preservation Manager, but can get easily sidetracked. I always look forward to November every year so I can celebrate World Digital Preservation Day which sometimes comes along with a PRONOM research week. This gives me a chance to look at formats that may need attention which are not normally on my radar.

This week I a taking a look again at Multiplan. There is a PRONOM PUID for version 4, but does not have a description nor does it have a binary signature. It is was also lacking a File Format Wiki entry. So I decided to dive in. I had already bumped into the format while doing some research on early Microsoft Excel formats. This includes the SYLK format which needed a little update.

Microsoft Multiplan was the parent of Microsoft Excel. Multiplan was built for many different types of computers in the 1980’s, but was never ported to Windows. So to use Multiplan you have to be comfortable with using DOS. If you want to take Multiplan for a spin, head over to PCjs Machines and load up one of the many emulated systems they have.

In the end, Multiplan had four versions, but the last one, version 4.2, had some big changes, especially to the file format. More on that in a minute.

Mutiplan Version 1 – DOS

hexdump -C MP1.MOD  | head
00000000  08 e7 00 00 58 09 01 00  08 00 01 00 00 00 0a 00  |....X...........|
00000010  40 00 00 00 2e f5 0a 80  27 07 94 00 12 00 01 00  |@.......'.......|
00000020  0a 00 01 00 0c 0a 08 00  27 00 0d 80 04 00 01 00  |........'.......|
00000030  54 00 00 00 27 00 10 00  54 52 41 4e 53 46 48 f5  |T...'...TRANSFH.|
00000040  00 80 84 0a 68 61 52 f5  58 f5 5a f5 4e f5 0c 0a  |....haR.X.Z.N...|
00000050  12 00 01 00 72 f5 72 f5  0a 80 4b 0b 0f 00 12 00  |....r.r...K.....|
00000060  0c 0a 01 00 0c 00 01 00  08 00 20 4e 40 00 09 00  |.......... N@...|
00000070  8a f5 0a 80 4a 07 30 00  48 00 01 00 20 4e 00 00  |....J.0.H... N..|
00000080  28 0c 18 00 04 00 0d 80  03 00 28 0c 04 00 00 00  |(.........(.....|
00000090  26 00 00 00 54 52 d0 01  00 00 a4 f5 0a 00 62 0b  |&...TR........b.|

Mutiplan Version 1 – Macintosh

hexdump -C Multiplan1  | head
00000000  11 ab 00 00 13 e8 00 00  00 00 00 00 00 02 02 8c  |................|
00000010  00 18 00 0e 02 a4 02 b2  00 0e 02 fe 00 03 00 0e  |................|
00000020  00 bd 01 e3 2f 0f 00 08  15 5e 19 d1 03 5e 19 dd  |..../....^...^..|
00000030  61 60 60 5e 16 90 00 67  60 60 60 8f 5f 03 e8 7a  |a``^...g```._..z|
00000040  30 61 60 60 13 5f 03 e8  7b 90 00 67 60 60 60 8f  |0a``._..{..g```.|
00000050  16 85 67 60 60 60 8f 16  6d 85 61 60 60 13 5e 10  |..g```..m.a``.^.|
00000060  7b 90 00 67 60 60 60 8f  13 7a 31 14 6a d7 16 6e  |{..g```..z1.j..n|
00000070  85 14 77 60 16 6f 85 67  60 60 60 90 00 67 60 60  |..w`.o.g```..g``|
00000080  60 90 00 67 60 60 60 8f  13 7a 31 14 6a d7 16 70  |`..g```..z1.j..p|
00000090  85 14 77 60 16 71 85 67  60 60 60 90 00 67 60 60  |..w`.q.g```..g``|

Mutiplan Version 2 – DOS

hexdump -C MP2.MOD | head
00000000  0c ec 00 00 08 ab 08 00  1f 00 1a 00 03 00 27 03  |..............'.|
00000010  4b 05 00 00 00 00 00 00  00 00 00 1d c8 14 03 00  |K...............|
00000020  00 00 2f 00 9a 2e b3 fc  46 02 34 04 f3 16 00 00  |../.....F.4.....|
00000030  00 00 00 00 08 00 10 22  00 00 0d 06 84 1d 08 1d  |......."........|
00000040  ff 03 83 0a c8 18 48 1a  02 19 00 00 00 00 15 1b  |......H.........|
00000050  98 15 85 15 03 00 2a 00  00 37 46 32 1c 00 18 00  |......*..7F2....|
00000060  00 00 01 00 01 00 a9 03  0f 80 e8 14 00 00 01 00  |................|
00000070  6a 1c 00 00 01 00 0d 00  0f 80 0a 15 00 00 77 20  |j.............w |
00000080  00 00 01 00 6e 61 6c 20  00 00 2a 00 00 00 04 00  |....nal ..*.....|
00000090  00 00 0d 00 14 19 00 00  d4 06 0e 80 24 15 00 00  |............$...|

The DOS files for Version 2 begin with 0CEC0000 08AB0800, but a file for the Xenix system starts with 0AEC0000 08AB0800. So it appears the first byte may be different depending on the system.

hexdump -C MP3.MOD | head                         
00000000  0c ed 00 00 08 ab 08 00  1f 00 1a 00 00 00 00 00  |................|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000110  00 00 02 00 01 00 00 00  00 00 ff 0f ff 00 00 00  |................|
00000120  00 00 05 00 06 00 46 00  36 00 42 00 00 00 00 00  |......F.6.B.....|
00000130  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000140  00 00 00 00 00 00 00 00  00 00 00 00 00 01 00 00  |................|
00000150  00 fe 0f 00 fe 00 00 00  00 00 00 00 00 00 00 00  |................|
00000160  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

The DOS files for Version 3 begin with a similar hex pattern, 0CED0000 08AB0800. This would make sense as the documentation for Multiplan 4.2 states it supports opening of Version 2 & 3, but not Version 1.

There was also a companion product that went along with Multiplan, it was called Microsoft Chart. Here is a file from version 3:

hexdump -C EXAMPLE1.MC | head
00000000  90 01 00 00 08 ab 00 00  00 00 00 00 00 00 04 00  |................|
00000010  80 00 05 00 04 00 43 10  00 00 00 00 00 00 00 00  |......C.........|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000080  e8 ff 04 00 00 22 24 36  a4 1f 00 00 11 00 24 00  |....."$6......$.|
00000090  03 00 64 00 00 00 cc 0c  cc 0c cc 0c cc 0c 00 00  |..d.............|
000000a0  00 00 00 00 ff 7f 00 00  01 f0 00 00 00 5f 00 00  |............._..|
000000b0  00 00 a2 ff a0 ff 00 00  01 f0 01 00 64 00 00 00  |............d...|
000000c0  00 00 01 f0 00 00 01 70  00 00 8e ff 8c ff 00 ff  |.......p........|
000000d0  0d 00 00 00 20 14 01 00  00 00 00 00 00 00 00 00  |.... ...........|

The Chart file format has a similar byte pattern with the 08AB pattern and looks similar to the BIFF format. We will have to make sure it doesn’t conflict with any signatures so it can be identified separately.

Version 4 of Multiplan was the first to use the BIFF (Binary Interchange File Format). Technically Version BIFF2, not much is know about BIFF1 or if it ever existed. BIFF2 is the exact same format as Excel 2.0 used, so there will be some problems if we want to identify them separately. They currently identify as fmt/55.

hexdump -C MP4.MOD | head
00000000  09 00 04 00 40 01 10 00  42 00 02 00 b5 01 66 00  |....@...B.....f.|
00000010  1b 00 00 00 00 00 00 00  00 00 ff ff 0f 01 00 01  |................|
00000020  00 01 00 01 00 00 00 00  00 00 00 00 00 0d 00 02  |................|
00000030  00 01 00 0e 00 02 00 01  00 0f 00 02 00 00 00 11  |................|
00000040  00 02 00 00 00 2a 00 02  00 00 00 6b 00 13 00 01  |.....*.....k....|
00000050  00 00 00 00 00 fe 0f 00  fe 40 02 e0 3d d0 2f 00  |.........@..=./.|
00000060  01 00 26 00 08 00 00 00  00 00 00 00 e0 3f 27 00  |..&..........?'.|
00000070  08 00 00 00 00 00 00 00  e0 3f 28 00 08 00 00 00  |.........?(.....|
00000080  00 00 00 00 f0 3f 29 00  08 00 00 00 00 00 00 00  |.....?).........|
00000090  f0 3f 70 00 0b 00 00 00  2e 00 02 04 f0 0a 00 f0  |.?p.............|

hexdump -C EXCEL2.XLS | head
00000000  09 00 04 00 02 00 10 00  0b 00 10 00 71 02 00 00  |............q...|
00000010  01 00 29 00 06 03 00 00  dc 0d 00 00 0c 00 02 00  |..).............|
00000020  64 00 0d 00 02 00 01 00  0e 00 02 00 01 00 0f 00  |d...............|
00000030  02 00 01 00 10 00 08 00  fc a9 f1 d2 4d 62 50 3f  |............MbP?|
00000040  11 00 02 00 00 00 22 00  02 00 00 00 40 00 02 00  |......".....@...|
00000050  00 00 2a 00 02 00 00 00  2b 00 02 00 00 00 25 00  |..*.....+.....%.|
00000060  02 00 2c 01 31 00 09 00  c8 00 00 00 04 48 65 6c  |..,.1........Hel|
00000070  76 32 00 0e 00 00 00 00  00 00 00 90 01 00 00 00  |v2..............|
00000080  00 00 8d 31 00 09 00 c8  00 01 00 04 48 65 6c 76  |...1........Helv|
00000090  32 00 0e 00 00 00 00 00  00 00 bc 02 00 00 00 00  |2...............|

You can see in the hex values above a difference of two bytes in the header. The reason the Multiplan file identifies as an Excel 2 file is the PRONOM signature ignores those two bytes and allows them to be anything. Some specifications say these aren’t used, but clearly there is a use for them. We could probably use the same signature for Multiplan, but include the two bytes, then set the priority to the Multiplan signature.

Multiplan 4.2 is very different.

hexdump -C MP42.MOD | head
00000000  0c ef 4d 50 a4 01 00 00  00 00 00 00 00 00 00 00  |..MP............|
00000010  00 00 00 00 00 00 80 02  00 00 00 00 00 00 00 2e  |................|
00000020  ff 0f ff 00 01 00 d0 02  d0 02 a0 05 a0 05 d0 2f  |.............../|
00000030  e0 3d 40 02 09 00 03 00  02 04 0a 00 00 00 fe 0f  |.=@.............|
00000040  00 fe 00 00 01 00 01 00  00 00 00 00 00 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  01 00 00 00 06 01 15 50  |...............P|
00000060  05 00 00 00 00 00 00 00  06 00 13 00 07 00 07 00  |................|
00000070  00 00 00 00 00 00 08 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

The hex values for the first 4 bytes have a similar pattern. 0CEF, Which seems to be in sequence where Version 3 left off. Microsoft calls this new format, New or Normal Binary File Format. They claim it is “the fastest loading and fastest saving file format ever“! Exciting as the new format probably was, it didn’t last long. Multiplan was phased out so Excel could shine.

When I was younger I didn’t use DOS very often because the computer my father brought home in the mid 1980’s was a Macintosh. I use DOS more now in my research then I did when I was younger. Using the DOS interface is not easy. There are a lot of key commands you need to know intuitively just to navigate, but it is fascinating to see how far software has come. Early Excel, Multiplan, and Chart were all intertwined, but hopefully combing through all of these samples can bring some clarity. Take a look at the draft signature I made and all the samples that go with it on my GitHub page.

Early Microsoft Excel

The first version of Microsoft Excel was released on Macintosh in 1985. Before that there was MultiPlan.

MultiPlan version 4 and Excel version 2 used the well known and documented BIFF format. Before BIFF2 the formats are a bit of a mystery. AFAIK, Microsoft never released any documentation on the file format used for Excel version 1 and MultiPlan 1 -3, they emphasized using the SYLK format for interchange. To make matters worse, there were upwards of 100 different versions of the early MultiPlan, ported for dozens of different systems. Some of them are discussed on the TRS-80 website.

Or you can take MultiPlan 1.06 for a spin over at PCjs!

Needless to say documenting and finding a pattern which could be used to identify the early versions of MultiPlan and Excel 1 are difficult. These versions are missing from the PRONOM registry, but hopefully with enough samples, some patterns can be found to confidently identify formats from the early days of spreadsheets!

Marco Pontello’s TrID identifier software has signatures for the early Multiplan and Excel formats. His software scans for patterns in samples and finds commonalities between them. So the more samples he can scan the more accurate the identification can be.

Currently the signatures are as follows.

Microsoft Excel for Mac Spreadsheet (v1.x)
		<Pattern>
			<Bytes>532700</Bytes>
			<ASCII> S '</ASCII>
			<Pos>0</Pos>
		</Pattern>
		<Pattern>
			<Bytes>AB27000000000000000203</Bytes>
			<ASCII> . '</ASCII>
			<Pos>4</Pos>
		</Pattern>
Multiplan for Mac spreadsheet (v1.x)
		<Pattern>
			<Bytes>11AB000013E8000000000000</Bytes>
			<ASCII> . . . . . . . . . . . .</ASCII>
			<Pos>0</Pos>
		</Pattern>
Multiplan spreadsheet (v1.x)
		<Pattern>
			<Bytes>0CE9000008AB08001F0016000200</Bytes>
			<Pos>0</Pos>
		</Pattern>
Multiplan spreadsheet (v1.0x)
		<Pattern>
			<Bytes>08E700</Bytes>
			<Pos>0</Pos>
		</Pattern>
		<Pattern>
			<Bytes>0100</Bytes>
			<Pos>6</Pos>
		</Pattern>
		<Pattern>
			<Bytes>000000</Bytes>
			<Pos>11</Pos>
		</Pattern>
Multiplan spreadsheet (v2.x)
		<Pattern>
			<Bytes>0CEC000008AB08001F001A000300</Bytes>
			<Pos>0</Pos>
		</Pattern>
Multiplan for Xenix spreadsheet (v2.x)
		<Pattern>
			<Bytes>0AEC000008AB08001F001A000300</Bytes>
			<Pos>0</Pos>
		</Pattern>
Multiplan spreadsheet (v3.x)
		<Pattern>
			<Bytes>0CED000008AB08001F001A000000</Bytes>
			<Pos>0</Pos>
		</Pattern>

There seems to be some patterns between versions, but then also some major differences. Without a specification or an understanding of the system the samples were created on, it is hard to identify these formats with certainty. There could be hex values which are the same for the samples we have but different for others, headers can often have values indicating dates or length of the file, so finding variations in files is key to a good signature.

Keep an eye on my GitHub PRONOM Research folder as I add more samples and prepare a signature for PRONOM.