I think when most of us have some data to sort or make sense of, we tend to gravitate toward a spreadsheet. Using Excel or LibreOffice, or if you really like to party, OpenRefine. There are plenty of meme’s out there representing the frustration people have with bugs, features and limitations of Excel specifically.
There are more tools out there for making sense of data, one some people have access to is Microsoft’s more advanced PowerBI tool. Marketed as a Data Visualization tool it is accessible to many with a Office 365 subscription. It offers expanded features than excel and isn’t as limited in row maximums.
PowerBi was recently the topic of a Code4Lib editorial issue. The writer of an article for their journal posted two PowerBI datasets which a reader later noticed had private data. After some miscommunications and misunderstandings an open letter was drafted and received some support. Code4Lib did release a statement and lessons were learned.
One statement from the Code4Lib staff caught my eye. “The released files were in a proprietary file format, Microsoft Power BI, with which none of the editors have experience.”
We all use tools for our jobs we are most familiar or available to us. No one can be an expert in all file formats. Some us try, but things change so fast it is impossible. But, we can do more in documenting and making formats identifiable through the tools we use for digital preservation. The File Format Wiki and PRONOM have had no mention of Power BI, so let’s change that.
Microsoft Power BI was released in 2011 and has been part of the Microsoft Power Platform. Power BI can gather data from many sources. The software can be accessed in the Office 365 cloud, but also using a Desktop application. In the desktop application, all the data sources and connections are stored in a single file with the extension PBIX. But there are other related formats.
filename : 'PowerBI-Test.pbix' filesize : 401951 modified : 2024-02-22T11:29:41-07:00 errors : matches : - ns : 'pronom' id : 'x-fmt/263' format : 'ZIP Format' version : mime : 'application/zip' class : 'Aggregate' basis : 'byte match at [[0 4] [401867 3] [401929 4]]' warning : 'extension mismatch' Path = PowerBI-Test.pbix Type = zip Physical Size = 401951 Date Time Attr Size Compressed Name ------------------- ----- ------------ ------------ ------------------------ 2024-02-22 18:29:40 ..... 8 10 Version 2024-02-22 18:29:40 ..... 488 230 [Content_Types].xml 2024-02-22 18:29:40 ..... 397312 397312 DataModel 2024-02-22 18:29:40 ..... 2848 882 Report/Layout 2024-02-22 18:29:40 ..... 328 161 Settings 2024-02-22 18:29:40 ..... 136 120 Connections 2024-02-22 18:29:40 ..... 18972 1733 Report/StaticResources/SharedResources/BaseThemes/CY24SU02.json 2024-02-22 18:29:40 ..... 358 357 SecurityBindings ------------------- ----- ------------ ------------ ------------------------ 2024-02-22 18:29:40 420450 400805 8 files
Just like many modern Microsoft formats it is a ZIP container with a mixture of XML and JSON. There is also a DataModel file along with Settings and Connections. A quick peek at some of the contents shows us:
hexdump -C PowerBI-Test/Version | head 00000000 31 00 2e 00 32 00 38 00 |1...2.8.| hexdump -C PowerBI-Test/DataModel | head 00000000 ff fe 53 00 54 00 52 00 45 00 41 00 4d 00 5f 00 |..S.T.R.E.A.M._.| 00000010 53 00 54 00 4f 00 52 00 41 00 47 00 45 00 5f 00 |S.T.O.R.A.G.E._.| 00000020 53 00 49 00 47 00 4e 00 41 00 54 00 55 00 52 00 |S.I.G.N.A.T.U.R.| 00000030 45 00 5f 00 29 00 21 00 40 00 23 00 24 00 25 00 |E._.).!.@.#.$.%.| 00000040 5e 00 26 00 2a 00 28 00 3c 00 42 00 61 00 63 00 |^.&.*.(.<.B.a.c.| 00000050 6b 00 75 00 70 00 4c 00 6f 00 67 00 3e 00 3c 00 |k.u.p.L.o.g.>.<.| 00000060 42 00 61 00 63 00 6b 00 75 00 70 00 52 00 65 00 |B.a.c.k.u.p.R.e.| 00000070 73 00 74 00 6f 00 72 00 65 00 53 00 79 00 6e 00 |s.t.o.r.e.S.y.n.| 00000080 63 00 56 00 65 00 72 00 73 00 69 00 6f 00 6e 00 |c.V.e.r.s.i.o.n.| 00000090 3e 00 31 00 34 00 30 00 3c 00 2f 00 42 00 61 00 |>.1.4.0.<./.B.a.| hexdump -C PowerBI-Test/\[Content_Types\].xml | head 00000000 ef bb bf 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e |...<?xml version| 00000010 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d |="1.0" encoding=| 00000020 22 75 74 66 2d 38 22 3f 3e 3c 54 79 70 65 73 20 |"utf-8"?><Types | 00000030 78 6d 6c 6e 73 3d 22 68 74 74 70 3a 2f 2f 73 63 |xmlns="http://sc| 00000040 68 65 6d 61 73 2e 6f 70 65 6e 78 6d 6c 66 6f 72 |hemas.openxmlfor| 00000050 6d 61 74 73 2e 6f 72 67 2f 70 61 63 6b 61 67 65 |mats.org/package| 00000060 2f 32 30 30 36 2f 63 6f 6e 74 65 6e 74 2d 74 79 |/2006/content-ty| 00000070 70 65 73 22 3e 3c 44 65 66 61 75 6c 74 20 45 78 |pes"><Default Ex| 00000080 74 65 6e 73 69 6f 6e 3d 22 6a 73 6f 6e 22 20 43 |tension="json" C| 00000090 6f 6e 74 65 6e 74 54 79 70 65 3d 22 22 20 2f 3e |ontentType="" />|
So it looks like the ZIP structure follows the standard for OpenXML packages as it contains a “[Content_Types].xml” file. So using this XML alone would clash with too many other formats. From what I could find the “DataModel” file is what stores the data is more unique to this format, even though the name is pretty generic. Using a string within the file would probably help be more accurate. The “DataModel” file does have unicode double byte strings we can use. “STREAM_STORAGE_SIGNATURE” seems like a unique enough string to use, but it looks like it may not be unique to PBIX. Looks like the “DataModel” file is a Microsoft “MS-XLDM” file format and is a “Spreadsheet Data Model File Format“.
There is a variation to the DataModel file and I am not sure when the standard is used verses this variation, “This backup was created using XPress9 compression”. Not sure if it is versioning or how the file is saved, but they both seem to function correctly.
hexdump -C DataModel | head 00000000 54 00 68 00 69 00 73 00 20 00 62 00 61 00 63 00 |T.h.i.s. .b.a.c.| 00000010 6b 00 75 00 70 00 20 00 77 00 61 00 73 00 20 00 |k.u.p. .w.a.s. .| 00000020 63 00 72 00 65 00 61 00 74 00 65 00 64 00 20 00 |c.r.e.a.t.e.d. .| 00000030 75 00 73 00 69 00 6e 00 67 00 20 00 58 00 50 00 |u.s.i.n.g. .X.P.| 00000040 72 00 65 00 73 00 73 00 39 00 20 00 63 00 6f 00 |r.e.s.s.9. .c.o.| 00000050 6d 00 70 00 72 00 65 00 73 00 73 00 69 00 6f 00 |m.p.r.e.s.s.i.o.| 00000060 6e 00 2e 00 00 00 00 b0 07 00 76 75 00 00 2a d7 |n.........vu..*.| 00000070 86 4e 00 b0 07 00 ad ab 03 00 2c cb 06 00 00 00 |.N........,.....| 00000080 00 00 f8 6c 86 7f 00 00 00 00 68 01 56 6e 00 00 |...l......h.Vn..| 00000090 20 82 67 49 52 06 00 f6 ab fc fc fe 2d f6 da 8b | .gIR.......-...|
After a bit of digging it seems like the MS-XLDM format can be found within an XSLX file. I found an example with these datasets. Within an XSLX there can be a found a file “xl/model/item.data” and it has the same structure as DataModel within a PBIX.
hexdump -C Customer Profitability Sample-no-PV/xl/model/item.data | head 00000000 ff fe 53 00 54 00 52 00 45 00 41 00 4d 00 5f 00 |..S.T.R.E.A.M._.| 00000010 53 00 54 00 4f 00 52 00 41 00 47 00 45 00 5f 00 |S.T.O.R.A.G.E._.| 00000020 53 00 49 00 47 00 4e 00 41 00 54 00 55 00 52 00 |S.I.G.N.A.T.U.R.| 00000030 45 00 5f 00 29 00 21 00 40 00 23 00 24 00 25 00 |E._.).!.@.#.$.%.| 00000040 5e 00 26 00 2a 00 28 00 3c 00 42 00 61 00 63 00 |^.&.*.(.<.B.a.c.| 00000050 6b 00 75 00 70 00 4c 00 6f 00 67 00 3e 00 3c 00 |k.u.p.L.o.g.>.<.| 00000060 42 00 61 00 63 00 6b 00 75 00 70 00 52 00 65 00 |B.a.c.k.u.p.R.e.| 00000070 73 00 74 00 6f 00 72 00 65 00 53 00 79 00 6e 00 |s.t.o.r.e.S.y.n.| 00000080 63 00 56 00 65 00 72 00 73 00 69 00 6f 00 6e 00 |c.V.e.r.s.i.o.n.| 00000090 3e 00 31 00 35 00 30 00 3c 00 2f 00 42 00 61 00 |>.1.5.0.<./.B.a.|
Because this file has a different filename and is in a different path, using “DataModel” should keep identification specific to a PBIX file.
The Power BI Report has a template option. This format uses the .PBIT extension and doesn’t contain any data only a template to use with other data. The structure is roughly the same, but doesn’t contain the “DataModel” file, but “DataModelSchema”, which appears to be a JSON file.
hexdump -C DataModelSchema | head 00000000 7b 00 0d 00 0a 00 20 00 20 00 22 00 6e 00 61 00 |{..... . .".n.a.| 00000010 6d 00 65 00 22 00 3a 00 20 00 22 00 38 00 36 00 |m.e.".:. .".8.6.| 00000020 65 00 34 00 32 00 62 00 33 00 30 00 2d 00 30 00 |e.4.2.b.3.0.-.0.| 00000030 34 00 34 00 33 00 2d 00 34 00 36 00 30 00 63 00 |4.4.3.-.4.6.0.c.| 00000040 2d 00 61 00 36 00 66 00 36 00 2d 00 36 00 66 00 |-.a.6.f.6.-.6.f.| 00000050 34 00 35 00 35 00 66 00 64 00 64 00 31 00 61 00 |4.5.5.f.d.d.1.a.| 00000060 35 00 36 00 22 00 2c 00 0d 00 0a 00 20 00 20 00 |5.6.".,..... . .| 00000070 22 00 63 00 6f 00 6d 00 70 00 61 00 74 00 69 00 |".c.o.m.p.a.t.i.| 00000080 62 00 69 00 6c 00 69 00 74 00 79 00 4c 00 65 00 |b.i.l.i.t.y.L.e.| 00000090 76 00 65 00 6c 00 22 00 3a 00 20 00 31 00 35 00 |v.e.l.".:. .1.5.|
The DataModelSchema JSON has some plain text strings which could be used for identification. Later in the file there is a string, “defaultPowerBIDataSourceVersion“.
000001c0 20 00 20 00 20 00 7d 00 2c 00 0d 00 0a 00 20 00 | . . .}.,..... .| 000001d0 20 00 20 00 20 00 22 00 64 00 65 00 66 00 61 00 | . . .".d.e.f.a.| 000001e0 75 00 6c 00 74 00 50 00 6f 00 77 00 65 00 72 00 |u.l.t.P.o.w.e.r.| 000001f0 42 00 49 00 44 00 61 00 74 00 61 00 53 00 6f 00 |B.I.D.a.t.a.S.o.| 00000200 75 00 72 00 63 00 65 00 56 00 65 00 72 00 73 00 |u.r.c.e.V.e.r.s.| 00000210 69 00 6f 00 6e 00 22 00 3a 00 20 00 22 00 70 00 |i.o.n.".:. .".p.| 00000220 6f 00 77 00 65 00 72 00 42 00 49 00 5f 00 56 00 |o.w.e.r.B.I._.V.| 00000230 33 00 22 00 2c 00 0d 00 0a 00 20 00 20 00 20 00 |3.".,..... . . .|
Seems like the best identification of the template format.
As usual you can find my signature proposal on my GitHub along with a couple “safe” samples.