Step 4 — Manage Your Data
Data collection is a primary goal of most citizen science and crowdsourcing projects; similar considerations apply to projects that focus on data processing, e.g., classifying image contents. Successful projects must ensure data quality, usefulness and preservation.
The following tips will help you get started:
- Think of your data as an asset.
- Prepare a data management plan.
- Acquire your data.
- Process your data.
- Analyze your data.
- Share your data.
- Preserve your data.
- The OpenPV Project: Crowdsourcing Solar Energy Data
- The National Map Corps: Crowdsourcing Map Data
- The SMAP/GLOBE Partnership: Citizen Scientists Measure Soil Moisture
- CoCoRaHS — Community Collaborative Rain, Hail and Snow Network: Citizen Scientists Track Precipitation
- The Monarch Larva Monitoring Project: Citizen Scientists Monitor Monarch Butterflies
- Project BudBurst: Citizen Scientists Track Seasonal Plant Changes
- The Air Sensor Toolbox: Citizen Scientists Measure Air Quality
- Measuring Broadband America’s FCC Speed Test App for Android and iOS: Crowdsourcing Mobile Broadband Performance
To ensure the usefulness of your data, think of it as an asset with a “data lifecycle” of interlinked phases, including planning, acquisition, processing, analyzing, preserving and sharing. You will need to answer questions related to documentation, storage, quality assurance and ownership for each stage of the data lifecycle. At each stage, consider cross-cutting elements, such as description (including metadata and documentation), quality management, backup and security.
- Before starting your project, understand your data needs. Make sure that the data you collect will help you achieve your overall project goals
- Make sure that volunteers have the skills or training needed to collect or analyze data with the quality you need.
- Tailor the scope of your data collection to your project needs. Make sure you are collecting data over the right spatial area for the right amount of time. Sampling can be particularly challenging in citizen science due to natural biases, but thoughtful strategies can help avoid problems from oversampling some areas and undersampling others, for the right amount of time.
- Keep in mind potential legal and ownership issues associated with the data you collect. Figure out what types of data you will share, who owns it and who will have access. Make sure everyone involved in the project understands and agrees.
Planning for data management is closely related to the second “How To” step in this toolkit (Design a Project). Write a data management plan to will help you evaluate what type of data to collect, how to collect it, and what additional resources you will need — plus it’s required under the federal Open Data Policy. You will need to take several general considerations into account before developing a specific data management plan.
- In your plan you should address:
- standards, responsibilities and methods for data collection;
- data description (metadata) and structure (schema);
- data evaluation, quality assurance and quality control; and
- methods for data hosting and preservation, sharing, statistical analysis and getting feedback
- Consider such questions as:
- What data are you collecting? Do the data already exist?
- Will you need to incorporate data from outside sources to meet your project goals?
- Who is responsible for managing the data and the data management plan?
- How will the data be collected?
- What format will the data and their metadata be in?
- How will the data be checked and certified?
- What are all the likely uses for the data, who will use them, and what kinds of outputs will be needed?
- How will the data be stored and backed up and for how long?
- Obtain any required institutional permissions or approvals– you may need to look at the Paperwork Reduction Act and the Privacy Act, and/or consult with your Human Subjects Board. These may include requirements for data preservation or privacy protection.
- Obtain any needed permissions and waivers from volunteers, such as for potential future uses of the data they’ll be collecting or analyzing (examples coming soon).
- Document your project’s policies, terms and conditions relating to privacy; to participation (such as age restrictions and physical requirements); and to data ownership, access and use. Make these policies available to participants in plain language (examples coming soon).
- Data Management Guide for Public Participation in Scientific Research
- Data Management Planning Tool
- Federal Open Data Policy
- Human Subjects Board (Wikipedia)
- Paperwork Reduction Act of 1995
- Paperwork Reduction Act Fast Track Process (DigitalGov)
- USGS Data Management Checklist (PDF)
- USGS Data Management Plans
You can acquire new data by collecting them, by adapting old data, by sharing or exchanging data and by purchasing data. In citizen science and crowdsourcing projects that involve data collection, volunteers typically record their empirical observations or use equipment such as cameras to create data. The more accurate your volunteers are in collecting data, the more credibility your project will have and the less work you’ll need to filter and clean up data later on.
- Whenever possible, use standardized protocols for data collection to ensure consistency and to help volunteers know what to do when. Test your protocols and questions in a pilot project to check how easy they are to understand, how easy they are to use and the accuracy of results. Expect to make at least two rounds of revisions prior to launching.
- Train your volunteers and give them the information they need to understand the data they are collecting, including easy-to-understand training materials. Consider creating a video; video training can be as effective as in-person training.
- Asking volunteers to take photos can give you a useful way to evaluate recorded observations or classifications and give feedback. The photos themselves can act as data, providing additional information beyond written or numerical responses.
- Be flexible. Consider a range of tools and approaches for collecting the data you need.
- Mobile devices can decrease errors by automatically and consistently collecting data such as time and location; they can also streamline the handling of photos and other sensor data. However, relying solely on mobile devices can limit collection in isolated areas or exclude people who can’t afford them.
- Consider applying a custom taxonomy or other standard, where appropriate, that allows observational data to be entered at varying levels of certainty. Some participants will choose not to submit data when they are uncertain of their performance, and can either be reassured or offered means of indicating certainty.
- Consider using many ways to collect data, particularly if your project requires participation from isolated communities or a range of socio-economic and age groups. (i.e., paper data sheets) Having options for both non-digital and digital input will allow everyone to participate. For example, participants might record observations with either an app or pen-and-paper questionnaires. If feasible, you can also provide data collection devices to volunteers, either permanently or on loan.
- To keep your database clean and ready to use, follow the standard practices of traditional data collection and data entry—for example, you can frame questions as multiple choice, or only accept responses as numbers within a certain range. This also reduces the likelihood of “spam” submissions and fraudulent data.
- Beach Watch: Citizen Science on Shore Conditions
- Mark2Cure: Crowdsourcing Medical Literature to Find Cures
- National Phenology Network: Metadata for Plant and Animal Phenology Datasets
- Snapshot Serengeti: Crowdsourcing Captures African Species Data (Journal Article, 2015)
- USGS Data Acquisition Methods
- USGS Data Quality Management
Synthesize your data and present it in a meaningful format based on appropriate data standards. Federal policies for open data and open access require that all data acquired for or funded by the federal government be made accessible in standard formats and, if possible, in non-proprietary and machine-readable formats.
- Decide whether you can strengthen your project by collecting data from both technicians/researchers and the public. Structure databases in a parallel format so that multiple data sources can be easily combined.
- Bring in data from alternate sources, such as remotely sensed data or weather information, that can help you check volunteer-collected data.
- Be sure to look for outliers in your dataset, such as very large or small numbers, that might indicate an error.
- When you notice potential errors, check whether they are systematic in some fashion, e.g., a common data entry error that can be easily corrected with a quick email and an edit to the training materials. Document these issues and make adjustments appropriately.
- If the project requires a substantial shift in procedures that will affect comparability of ongoing data, document these changes and the rationale, and notify participants in plain language. If feasible, provide data with both the original values, and values adjusted to compensate for changed methods.
- Use best practices for data management. For example, rigorously document your processing methods to ensure the integrity of your data. Include details on data transformations such as merging values into ranges, as well as rules applied for correcting data, detecting false or unacceptable records, and omitting data from public view (e.g., altered location resolution for observations of sensitive species). Ensure that participant privacy is being properly protected in any data that are publicly accessible.
As in any scientific undertaking, analysis helps you document and describe facts, detect patterns, develop explanations, test hypotheses and check for error. Analysis of citizen science or crowdsourcing data isn’t necessarily different from analysis of data collected by other methods, and can vary widely depending on the nature of the study and type of data. Knowing how you’ll analyze data before you create your final collection plan is key–if you are not familiar with how to analyze data for your project, find expert partners who can help you ensure a good match between collection plans, analytic methods, and project goals.
- Measure or account for error. Consider having multiple people make observations so you can estimate the variance between observers. If recognized experts can provide some observations, you can also evaluate differences between traditional and volunteer data collection. When appropriate, samples or vouchers can provide additional means for verification, but requiring a priori evidence may be an unnecessary barrier in some cases.
- Many statistical frameworks require accounting for effort. Identify ways to account for the effort your volunteers put into making their observations, making sure that your ways of accounting for effort are appropriate for your analytical method.
- Some analytics specific to citizen science can quantify the cost savings from using volunteers. Document such data to help evaluate the quality and success of your project. Volunteer hours is one of the most comparable metrics across projects, and can be tied to effort reporting.
- Have a non-scientist review materials intended for the public before distributing them.
One goal of citizen science and crowdsourcing is to generate data that meets your organization’s needs for basic research, problem solving, policy making, decision support or education. You should maintain and share your data in a medium that people can find, understand and easily use in a variety of technical and non-technical contexts.
Both raw and processed data will require accurate metadata (descriptions of data); metadata provide essential information about datasets including their ownership, origins, purpose, content, scope and structure; methods of handling and processing the datasets; and legal constraints on their use. This information is critical to ensure that you fully understand your data, can readily evaluate them for quality and suitability, can successfully integrate them with other datasets, and can reuse or defend them if necessary.
- To the extent you can, figure out who will need your data or want to see them, whether it’s researchers, journalists, policymakers or a particular community.
- Decide on the most efficient, audience-appropriate and cost-effective ways of giving users the data access they need. Start by providing easy-to-use search and discovery tools.
- Consider how you can present and interpret your results to make them clear and understandable to your volunteers and other audiences. Translate results into plain language, use simple graphs and offer map-based visualizations where appropriate.
- Provide the simplest possible tools or methods for data visualization, evaluation and comparison, summary or abstraction (such as maps or GIS, statistical summaries and charts and graphs), and data download (such as CSV files for custom query results, as well as compressed packages of pre-selected, documented data).
- In sharing your data, know your organization’s standard review, approval and release policies. In particular, make sure to include controls to protect privacy, proprietary or other restricted information, and the integrity of the data itself.
- Make sure data recipients can access complete metadata and other documentation so that they can evaluate, replicate and make the best possible use of your results. Identify the sources, license, methods, and contents of the data.
- Make your data available for public use beyond your own immediate needs, in accordance with federal requirements for open data and open access. Request (or require) that participants share original images under an unrestrictive license such as CC-BY that permit redistribution. Organize your data to be searchable. If necessary, restrict access to certain parts (for example, to protect physically or culturally sensitive collection sites or threatened and endangered species).
Plan to preserve your data for the long term, meeting the data retention policies and practices of your agency as well as of the National Archives and Records Administration. You can preserve your data by archiving it or submitting it to an authorized data repository. You should organize and document your datasets well enough for others to understand and reuse them long term. You should also promptly label and replace outdated information.
- Find an authorized data repository for the long-term storage of your data. One example is the U.S. Geological Survey’s ScienceBase, which allows for storage of many different types of data and associated project information. There are also many repositories focused on specific topics and types of data.
- Arrange for the long-term storage of “archival” data — that is, data that remain important for future use but are no longer needed for immediate access.
- Prepare archival data by reviewing its metadata and documentation for accuracy, and making certain that potentially personally identifiable information about participants is properly managed.
- Consider how potential future users will discover that your archived data exist, along with the basics of what they contain. Make sure that your data are listed in catalogs or directories of data of similar types (such as MoveBank for animal movement data) and in the appropriate federal and agency open data catalogs (such as Data.gov and Data.doi.gov).