Skip to content

OpenAi says Deepseek has wrongly harvested its data

    OpenAi says that the evidence is investigating that the Chinese start-up Deepseek has broken its service conditions by harvesting large amounts of data from its AI technologies.

    The start-up-up, which is now appreciated in San Francisco, said at $ 157 billion, said that Deepseek may have used data that has been generated by OpenAI technologies to teach similar skills as its own systems.

    This process, called distillation, is common in the AI ​​field. But the service conditions of OpenAI say that the company allows no one to use data that has been generated by its systems to build technologies that compete in the same market.

    “We know that groups in the PRC are actively working on the use of methods, including what is known as distillation, to replicate advanced USAI models,” said spokeswoman Liz Bourgeois, in a statement by the New York Times, referring to the popular republic China.

    “We are aware of and assess indications that Deepseek wrongly distilled our models and share information as we know more,” she said. “We take aggressive, proactive countermeasures to protect our technology and will continue to work closely with the US government to protect the most capable models that are being built here.”

    Deepseek did not immediately respond to a request for comment.

    Deepseek created Silicon Valley Tech Companies and sent the American financial markets on a downward play earlier this week after releasing AI technologies that corresponded to the performance of something else on the market.

    The prevailing wisdom was that the most powerful systems could not be built without billions of dollars in specialized computer chips, but Deepseek said it had created his technologies with many fewer sources.

    Just like any other AI company, Deepseek has built its technologies with the help of computer code and data corrected from the internet. AI companies lean heavily on a practice called Open Sourcing, which freely share the code that underlies their technologies – and reuse code that is shared by others. They see this as a way to speed up technological development.

    They also need huge amounts of online data to train their AI systems. These systems learn their skills by designating patterns in text, computer programs, images, sounds and videos. The leading systems learn their skills by analyzing just about all text on the internet.

    Distillation is often used to train new systems. If a company takes data from its own technology, the practice can be legally problematic. But it is often allowed by open source technologies.

    OpenAi is now confronted with more than a dozen lawsuits that accuse illegal use of copyright internet data to train its systems. This includes a lawsuit filed by the New York Times against OpenAi and his partner Microsoft.

    The suit argues that millions of articles published by The Times were used to train automated chatbots that are now competing with the news exit as a source of reliable information. Both OpenAi and Microsoft deny the claims.

    A Times Report also showed that OpenAI used speech recognition technology to transcribe the audio of YouTube videos, which would deliver new conversation text that would make an AI system smarter. Some OpenAi employees discussed how such a movement could go against the rules of YouTube, said three people with knowledge of the conversations.

    An OpenAI team, including the President of the Company, Greg Brockman, has transcribed more than a million hours of YouTube videos, the people said. The texts were then entered in a system called GPT-4, which was generally considered one of the world's most powerful AI models and was the basis of the latest version of the Chatgpt-Chatbot.