Data analysis tips and help

Material and resources for best practices, learning, and support for doing data analysis.

Author
Affiliation

Omar Silverman

Warning

🚧 This website and most of its contents are often updated or modified. Many documents are at various stages of completion. 🚧

Some general practices doing data analysis

Data management

  • Save the raw data
  • Ensure that raw data are backed up in more than one location
  • Create the data you wish to see in the world
  • Create analysis-friendly data
  • Record all the steps used to process data
  • Aanticipate the need to use multiple tables, and use iunique identifier for every record
  • Submit data to a reptable DOI-issuing repository so that others can acces and cite it

Software

  • Place a brief explanatory comment at the start of every program
  • Decompose programs into functions
  • Be ruthless about eliminating duplication
  • Always search for well-maintained software libraries that do what you need
  • Test libraries before relying on them
  • Give functions and variables meaningful names
  • Make dependencies and requirements explicit
  • Do not comment and uncomment sections of code to control a program’s behavior
  • Provide a simple example or test data set
  • Submit code to a reputable DOI-issuing repository

Collaboration

  • Create an overview of your project
  • Create s shared “to-do” list for the project
  • Decide on communication strategies
  • Make the license explicit
  • Make the project citable

Project organization

  • Put each project in its own directory, which is named after the project
  • Put text documents associated with the project in the “doc” directory
  • Put raw data and metadata in a data directory and files generated during cleanup and analysis in a “result” directory
  • Put project source code in the “src” directory
  • Put external scripts or compiled programs in the “bin” directory
  • Name all files to reflect their content or function

Keeping track of changes

  • Back up (almost) everything created by a human being as soon as it is created
  • Keep changes small
  • Share changes frequently
  • Create, maintain, and use a checklist for saving and sharing changes to the project
  • Store each project in a folder that is mirrored off the researchers’ working machine
  • Add a file called changelog.txt to the project’s “docs” subfolder
  • Copy the entire project whenever a significant change has made
  • Use a version control system

Manuscripts

  • Write manuscripts using online tools with rich formatting, change tracking, and reference management
  • Write a manuscript in a plain text format that permits version control

Flow chart showing some steps to take to get help with your coding problems.

Flow chart to get help with your coding problems

Further reading:

  • Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510
  • Here you can find additional material to reproducible, ethical and collaborative data science: https://the-turing-way.netlify.app/welcome