How to fix Excel’s gene to date conversion
Although Excel can be a powerful tool for biologists, it also converts some gene names into dates. Here‘s an app to solve that issue.
UPDATE (June 2020): There’s now a web app version of Oct4th available at https://oct4th.sandbox.bio, and a command-line tool for larger datasets.
Microsoft Excel is likely one of the most widely used data analysis tools in the field of biology. And with reason. It enables almost anyone, with just a few clicks, to manage their data, quickly generate plots and calculate simple statistics.
In its infinite wisdom, however, Excel also interprets many gene names as dates, as was previously reported elsewhere. For example, the tumor-supressor gene DEC1 becomes December 1st, while the transcription factor OCT4 becomes October 4st:
This is an important issue because if you open a gene expression matrix from an RNA-Seq experiment into Excel, make changes, and save the spreadsheet, the original genes names will be lost, and will likely confuse any downstream analysis that relies on gene names.
Of course, you’d think we could recover the genes by converting the dates back into gene names, e.g. “December 1st” is DEC1 and “March 1st” is simply MARCH1. Unfortunately, there isn’t always a one-to-one relationship between dates and gene names!
Despite being very different genes, both MARCH2 and MARC2 get converted to March 2nd, leaving us with no easy way (except the order) to determine which expression count belongs to which gene.
Detecting Excel’s gene auto-conversions
Luckily, there’s a simple way to detect whether you already have that issue in your current Excel files: Right-click the column containing your gene names, and select Sort → Ascending. If your gene names were converted to dates, the top genes will typically look like “1-Dec”, “1-Mar”, and “4-Oct”.
Fixing the auto-conversion
Unfortunately there is no way to completely turn off the auto-conversion “feature” in the Excel settings. One workaround would be to modify your input files to add a single quote in front of each gene names, which is quite involved.
Alternatively, you can also avoid this issue by never opening CSV/TSV files directly in Excel. Instead, import your data through Excel’s Import Wizard (File → Import). In the last step of the wizard, click on the column containing your gene names, and set the “Column data format” to “Text” instead of “General”. This will disable the auto-conversion for the selected column:
There’s an app for that!
If this all seems too tedious and if you’d rather have an automated solution, I know just the app for you: Oct4th, an app that converts CSV/TSV data files into Excel files that are free of gene name issue.
Oct4th is available as:
- A web application available at https://oct4th.sandbox.bio/.
- A command-line tool available on PyPI that you can install by running
pip install oct4th; the command-line utility is useful if you have a lot of files to convert or if you’re converting large files.
I’d love to hear from you if you find this tool useful in the comments below, and make sure to check out this episode of the OMGenomics show where Maria Nattestad and I discuss this topic in more detail.