OpenAI Asks Contractors To Upload Work From Previous Assignments To Evaluate The Performance Of AI Agents

OpenAI asks third-party contractors to upload real assignments and tasks from their current or previous workplaces so it can use the data to evaluate the performance of its next-generation AI models, according to data from OpenAI and training data company Handshake AI obtained by WIRED.

The project appears to be part of OpenAI's efforts to establish a human baseline for various tasks, which can then be compared to AI models. In September, the company launched a new evaluation process to measure the performance of its AI models against human professionals across industries. OpenAI says this is a key indicator of progress toward achieving AGI, or an AI system that outperforms humans at the most economically valuable tasks.

“We hired people from different professions to help collect real-world tasks, modeled after the tasks you performed in your full-time job, so we can measure how well AI models perform at those tasks,” reads a confidential document from OpenAI. “Take existing pieces of long or complex work (hours or days+) that you have done in your profession and turn each work into a task.”

OpenAI asks contractors to describe tasks they've performed in their current job or in the past and to upload real examples of work they've done, according to an OpenAI presentation on the project viewed by WIRED. Each of the examples must be “a concrete output (not a summary of the file, but the actual file), e.g. a Word document, PDF, Powerpoint, Excel, image, repository,” according to the presentation. OpenAI says people can also share made-up work examples created to show how they would realistically respond in specific scenarios.

OpenAI and Handshake AI declined to comment.

Real-world tasks have two components, according to the OpenAI presentation. There's the task request (what a person's manager or colleague has assigned) and the task deliverable (the actual work he or she produced in response to that request). The company emphasizes several times in instructions that the examples contractors share should reflect the “real, on-the-job work” the person has “Actually finished.”

An example in the OpenAI presentation outlines a job for a “Senior Lifestyle Manager at a luxury concierge company for ultra-high net worth individuals.” The goal is to “prepare a short two-page PDF version of a seven-day yacht trip overview to the Bahamas for a family traveling there for the first time.” It includes additional details about the family's interests and what the itinerary should look like. The “experienced human outcome” then shows what the contractor would upload in this case: a real Bahamas itinerary created for a customer.

OpenAI instructs contractors to remove company intellectual property and personally identifiable information from the work files they upload. Under a section titled “Important Reminders,” OpenAI tells employees to “delete or anonymize anything: personal information, proprietary or confidential data, material non-public information (e.g. internal strategy, unreleased product details).”

One of the files reviewed by the WIRED document mentions a ChatGPT tool called “Superstar Scrubbing” that provides advice on how to remove confidential information.

Evan Brown, intellectual property attorney at Neal & McDevitt, tells WIRED that AI labs that receive confidential information from contractors on this scale could be subject to trade secret misappropriation claims. Contractors who provide documents from their previous workplaces to an AI company, even if they have been purged, risk violating their previous employers' non-disclosure agreements or exposing trade secrets.

“The AI Lab places a lot of trust in its contractors to decide what is and is not confidential,” says Brown. “If they let something through, do the AI labs really take the time to determine what is and is not a trade secret? It seems to me that the AI lab is putting itself in great danger.”

OpenAI asks contractors to upload work from previous assignments to evaluate the performance of AI agents