At the end of June, Microsoft released a new kind of artificial intelligence technology that could generate its own computer code.
The tool, called Copilot, is designed to speed up the work of professional programmers. As they typed on their laptops, it suggested ready-made blocks of computer code that they could add directly to their own blocks.
Many programmers loved or at least were intrigued by the new tool. But Matthew Butterick, a programmer, designer, writer and lawyer in Los Angeles, was not one of them. This month, along with a team of other attorneys, he filed a lawsuit seeking class action status against Microsoft and the other high-profile companies that designed and implemented Copilot.
Like many advanced AI technologies, Copilot developed its skills by analyzing massive amounts of data. In this case, it relied on billions of lines of computer code posted on the Internet. Mr Butterick, 52, equates this process with piracy because the system does not acknowledge its debt to existing work. His lawsuit alleges that Microsoft and its employees violated the legal rights of millions of programmers who spent years writing the original code.
The lawsuit is believed to be the first legal attack on a design technique called “AI training,” a way to build artificial intelligence ready to reshape the tech industry. In recent years, many artists, writers, experts and privacy activists have complained that companies are training their AI systems with data that does not belong to them.
The lawsuit has echoes in the technology industry’s last decades. In the 1990s and into the 2000s, Microsoft fought the rise of open source software, viewing it as an existential threat to the company’s future. As the importance of open source grew, Microsoft embraced it and even bought GitHub, a home for open source programmers and a place where they built and stored their code.
Almost every new generation of technology — even online search engines — has faced similar legal challenges. Often, “there’s no statute or case law that covers this,” says Bradley J. Hulbert, an intellectual property attorney who specializes in this increasingly important area of law.
The lawsuit is part of a tidal wave of concerns about artificial intelligence. Artists, writers, composers and other creative types are increasingly concerned that companies and researchers are using their work to create new technology without their consent and without compensation. Companies train a wide variety of systems in this way, including art generators, speech recognition systems such as Siri and Alexa, and even self-driving cars.
Copilot is based on technology built by OpenAI, an artificial intelligence lab in San Francisco, backed by a billion dollars in funding from Microsoft. OpenAI is at the forefront of the increasingly widespread effort to train artificial intelligence technologies using digital data.
More about Big Tech
- Microsoft: The company’s $69 billion deal for Activision Blizzard, which rests on winning the approval of 16 governments, has become a test of whether tech giants can buy companies amid a backlash.
- Apple: Apple’s largest iPhone factory, in the city of Zhengzhou, China, is facing labor shortages. Now that factory is getting help from an unlikely source: the Chinese government.
- Amazon: The company appears to be planning to lay off about 10,000 people in business and technology jobs, in what would be the largest cuts in the company’s history.
- meta: Facebook’s parent company said it laid off more than 11,000 people, or about 13 percent of its workforce
After Microsoft and GitHub released Copilot, GitHub’s CEO, Nat Friedman, tweeted that using existing code to train the system was “fair use” of the material under copyright law, an argument often used by companies and researchers who built these systems. But no court case has yet tested this argument.
“Microsoft and OpenAI’s ambitions go far beyond GitHub and Copilot,” Butterick said in an interview. “They want to train on all data for free, without permission, forever.”
In 2020, OpenAI unveiled a system called GPT-3. Researchers trained the system using vast amounts of digital text, including thousands of books, Wikipedia articles, chat logs, and other data posted on the Internet.
By locating patterns in all that text, this system learned to predict the next word in a sequence. When someone typed a few words into this “great model of language,” it could complete the thought with entire paragraphs of text. In this way, the system could write its own Twitter messages, speeches, poems and news articles.
Much to the surprise of the researchers who built the system, it could even write computer programs, apparently learned from an untold number of programs posted on the Internet.
So OpenAI went a step further and trained a new system, Codex, on a new set of data specifically populated with code. At least some of this code, the lab later said in a research paper describing the technology, came from GitHub, a popular programming service owned and operated by Microsoft.
This new system became the underlying technology for Copilot, which Microsoft distributed to programmers via GitHub. After about a year of testing with a relatively small number of programmers, Copilot rolled out to all coders on GitHub in July.
For now, the code Copilot produces is simple and could be useful for a larger project, but needs to be massaged, supplemented and vetted, many programmers who have used the technology said. Some programmers find it useful only when they are learning to code or trying to master a new language.
Still, Mr. Butterick worried that Copilot would eventually destroy the global community of programmers who built the code at the heart of most modern technologies. Days after the system’s release, he published a blog post titled, “This copilot is stupid and wants to kill me.”
mr. Butterick identifies as an open source programmer, part of the community of programmers who openly share their code with the world. Over the past 30 years, open source software has contributed to the emergence of most of the technologies consumers use every day, including web browsers, smartphones, and mobile apps.
While open source software is designed to be shared freely between programmers and companies, this sharing is governed by licenses designed to ensure that it is used in ways that benefit the wider community of programmers. Mr. Butterick believes that Copilot has violated these licenses and, as it continues to improve, will make open source coders obsolete.
After complaining publicly about the matter for several months, he filed his lawsuit with a handful of other attorneys. The lawsuit is still in its earliest stages and has not yet received class action status from the court.
To the surprise of many legal experts, Mr. Butterick’s lawsuit does not accuse Microsoft, GitHub, and OpenAI of copyright infringement. His lawsuit takes a different tack, arguing that the companies violated GitHub’s terms of service and privacy policy, while also violating a federal law that requires companies to display copyright information when using material.
Mr Butterick and another attorney behind the lawsuit, Joe Saveri, said the lawsuit could eventually address the copyright issue.
Asked if the company could discuss the lawsuit, a GitHub spokesperson declined, before saying in an emailed statement that the company “has been committed to responsible innovation with Copilot from the start and will continue to develop the product to help developers around the world.” to serve the world the best.” Microsoft and OpenAI declined to comment on the lawsuit.
Under existing law, most experts believe that training an AI system on copyrighted material is not necessarily illegal. But this could be if the system ends up producing material essentially similar to the data it was trained on.
Some Copilot users have said it generates code that appears identical – or nearly identical – to existing programs, an observation that could become the central part of Mr. Butterick and others.
Pam Samuelson, a professor at the University of California, Berkeley who specializes in intellectual property and its role in modern technology, said legal thinkers and regulators briefly explored these legal issues in the 1980s, before the technology existed. Now, she said, a legal review is needed.
“It’s not a toy problem anymore,” said Dr. Samuelson.