Software & Technology
AI Copyright Clash: The Future of AI Tools Could Hinge on the Result of The New York Times Legal Battle with OpenAI
The future of AI Tools, content ownership, fair use, business practices, and copyright laws could all depend on a legal battle that is shaping up to be one of the most important of the twenty-first century. In a landmark lawsuit, the New York Times has sued OpenAI and Microsoft, alleging copyright infringement by using…
This story was produced through MarketScale. See how Software & Technology teams put it to work with Code to Content.
Key takeaways
The future of AI Tools, content ownership, fair use, business practices, and copyright laws could all depend on a legal battle that is shaping up to be one of the most important of the twenty-first century.
In a landmark lawsuit, the New York Times has sued OpenAI and Microsoft, alleging copyright infringement by using…
The future of AI Tools, content ownership, fair use, business practices, and copyright laws could all depend on a legal battle that is shaping up to be one of the most important of the twenty-first century.
In a landmark lawsuit, the New York Times has sued OpenAI and Microsoft, alleging copyright infringement by using its articles to train ChatGPT, raising critical questions about fair use in AI. OpenAI argues that using publicly available internet materials for AI training constitutes fair use, transforming the data into new, original content. However, the Times contends that ChatGPT’s potential to replicate its content and bypass paywalls threatens its business model and market.
Will the outcome of this case set a significant precedent for the AI industry, potentially reshaping how AI companies create future AI tools and access and use copyrighted material and intellectual property? Legal experts and commentators suggest that this lawsuit may prompt a reevaluation of copyright laws and the development of new legal frameworks to adapt to the evolving AI landscape.
Host Daniel Litwin led a dynamic discussion on MarketScale’s Experts Talk, featuring Igor Jablokov, the CEO of Pryon; Tatiana Rice, the Senior Counsel at the Future of Privacy Forum; Mark Beccue, the AI Research Director at Futurum Group; and Lauren Maffeo, the author of “Designing Data Governance from the Ground Up.” The panel discussed the complexities of the lawsuit and its implications for AI development and content creation.
Highlights from this panel discussion include the following:
- The evolving nature of data, especially unstructured data like tweets and videos, and its use in training large language models (LLMs)
- The legal nuances of copyright law, fair use, and the potential impact on societal norms and business practices
- How the result of this lawsuit could influence the future of AI tools, content ownership, and the balance between innovation and copyright protection
Igor Jablokov is a recognized innovator in AI with a rich history in pioneering voice recognition technologies who previously founded AI industry pioneer Yap; the firm’s inventions were then built upon to develop Alexa, Echo, and Fire TV. He is known for his role in advancing AI applications in practical settings, bringing a wealth of experience from his time in leading tech companies. His expertise lies in integrating AI into everyday technology, making it more accessible and user-friendly.
Tatiana Rice brings a unique blend of legal expertise and a deep understanding of privacy issues in the digital age. Her work navigates the complex intersection of technology, privacy, and law. She is adept at analyzing the implications of emerging technologies on privacy and data protection, making her a valuable voice in discussions about the legal challenges posed by AI.
Mark Beccue is a seasoned researcher in the AI field, focusing on the impact of AI on market trends and consumer behavior. His research provides insights into how AI is reshaping industries and consumer experiences. His analytical skills and deep understanding of AI technologies make him a key contributor to discussions about the future direction of AI in business.
Lauren Maffeo’s background in journalism and her expertise in data governance provides a unique perspective on AI’s ethical and societal implications. She advocates for responsible AI practices and data governance, emphasizing the need for standards and accountability in the tech industry. Her work often explores the balance between technological innovation and ethical considerations.
Article by James Kent
Video TranscriptExpand ↓
Hello, everyone, and welcome to another episode of Experts Talk Market Scales premium debate and discussion round table, where we sit down with the top voices in your industry to figure out the largest trends, topics, tech, and timely news that are defining the market movers of your industry. I'm your host, Daniel Litwin, the voice of b two b. Folks, thanks so much for joining us. It's a chilly one today here in Dallas, a bit icy. You know how it gets when it even gets a little cold. No one knows how to drive in Texas. But guess what? Even if it's a little icy here in Dallas, our debate is a hot one today. You're not gonna wanna miss today's discussion. Before we get into it, make sure you head to market scale dot com for not only previous episodes of Experts Talk, we've had discussions in the food and beverage industry, hospitality industry, business services, and more. You can find those all on our channels page on market scale dot com, and make sure you follow along as we're gonna be going live almost every day for the rest of the year with great debates and great discussion. Alright, folks. Let's jump into it. Today's Experts Talk is gonna be on a big one. We're opening up debate and discussion on the New York Times versus OpenAI copyright question. Who is gonna win this AI copyright battle? Well, let's get into it, folks. Generative AI was probably the defining and trending technology of twenty twenty three, at least the one with the most buzz and the most headlines. There seems to be one every year, and Generative AI tools took the cake. ChatGPT, DALL E, Midjourney, those are just the biggest ones. But several more caught the attention of consumers and businesses and became mainstream in their application, in their thought pieces, in their op eds. Right? And the larger debate around how much impact is this gonna have on our day to day, and what kind of risky, you know, new layers does this bring to our tech ecosystem? What opportunities does it bring to businesses, to how we conduct our day as consumers, as people. Right? But more importantly, this all came with some major friction as well among content creators, content houses, LLMs. Right? The foundations behind these generative AI tools and the companies designing them. And now this has led to some legal trouble, and we're seeing that the New York Times has now officially filed a lawsuit against OpenAI and Microsoft, alleging copyright infringement by using their articles to train and test chat g p t. So the big question that we're gonna be unpacking and getting a little more context, nuance, and hopefully some what's next on from our panel today is what is in store for not only generative AI tools, but also the content feeding them. Right? Who owns the content? Does copyright law as we know it still stand in the face of, something at such a large scale, iterating upon and reinterpreting and reusing content. Right? Who deserves to get paid? Content creators? Content houses? No one? Everyone? We're gonna be picking it all apart here today with our diverse panel of guests. So let's go ahead and jump into it and let the experts do the talking today on Experts Talk. I'm pleased to welcome our panelists for the discussion. Let's go ahead and go down the line and give everyone a little hello. We're joined first by Igor Jablyakov. He's CEO and founder of Prion. Igor, great to have you on. How are you? Great having great being here. Thank you so much. Thank you for joining us, Igor. We're also joined by Tatiana Rice. She's senior counsel at the Future of Privacy Forum. How are you? Hey, everybody. Thank you. Doing well. Lovely. We're also joined today by Mark Beku. He's AI research director at the Futurum Group. Mark, welcome. Hi, Daniel. How are you doing? Doing great. Doing great, man. Thanks for joining us. Last but not least, we're joined by Lauren Maffio. She's author of Designing Data Governance from the Ground Up. Lauren, great to have you on as well. How are you? I'm doing well, Daniel. Thanks for having me. Absolutely. So folks, we've got a, wide variety of perspectives on the panel today. Each of our panelists is coming from their own slice of the market, but everyone definitely has a take on how this is going to impact the future of the industry. So I think it's time to just get into it. Before we get into the nitty gritty of it, panelists, so I wanna get y'all's thoughts on the potential, I guess, weight of this discussion at all. Before we really get into the weeds, Like, how much weight should we be putting behind it? Right? How important is this case for the development of generative AI tools and the adoption of generative AI kind of at large in the consumer and business mainstream in twenty twenty four? And really, do you foresee this being a consequential case? Or is this debate something that's, in the grand scheme of things, still more of a periphery discussion as the industry continues to evolve and develop? I'll open it up there. Something that I think this case reinforces is the idea of data and what it is, because data is something that we're used to seeing as numbers, as being inherently structured, as having very clean metadata attached to it. And this case against OpenAI really reinforces that the vast majority of data we're dealing with today is unstructured. It is tweets, video, audio, and these things are being used to train large language models at a scale that was previously incomprehensible. AI is not new. It has been around for many decades now, but the volume of data, both unstructured and structured as well as semi structured, that is new. And so I think the biggest thing that sticks out to me is this concept that art is data. It is used as training data. And even though we may not think of it as such in the public purview that this is very new frontier for every industry and sector. And so then when we get into copyright I'm not a lawyer, so I will definitely leave that to Tatiana, but it's it's the thing that sticks out with me above all else here is is this philosophical discussion of what is data and then the practicality of using what we perceive of as art to be a source of data for these large language models. Yeah. I I agree completely with what Laura said, and I would say that this is really the tip of the iceberg here. As we're talking about data, there are so many different forms of it and so many ways that these types of data are gonna get implicated by laws. Right? So we saw a couple years ago, LinkedIn was also suing regarding whether they owned the data on their platform for scraping. We have privacy laws regarding personal data that is scraped for these models. And now, of course, here we're dealing with copyrights. So there this is just the tip of the iceberg for what is going to continue to be a legal battle likely to go on for decades. Yeah. So I would say I agree in a sense. I'd say actually throw you a different one. There's a this is a can of worms, on multiple levels for AI. So and Igor might have some thoughts on this, but when it's not just the training data that's an issue, it's the outputs. And there's been some some legal issues around those outputs. So if you think of text generation or image generation, so, creators are working to be protected in that sense and it goes all the way from art to there's a suit out there against Microsoft and OpenAI for code development. So you can you you see how this is layered in a lot of ways where who owns what. It's not just how you train something to put something out. It's actually the thing that gets created is gonna be a lawsuit issue as well. Mark, you're exactly right. It's not the the training data that's coming in there, but it's also the the weights and biases that are that are an intermediate product and the actual output. Right? If you're creating derivatives of of, people's original copyrighted works, if you're stripping out attribution to the original works. Look. This feels like, the the Uber playbook all over again, where they knew that they were doing something that it was inherently, illegal in many municipalities, but they decided to ask for forgiveness rather than permission because they knew that they would never get permission. As I intersect with a lot of publishers, I basically tell them this. Look. If somebody's showing up with a publisher's clearinghouse sized check now to acquire your content, just know that means they believe that they're gonna derive ten to one hundred x more value from, from your own content than, than, that that piece of paper that we're showing up with. And secondly, they have really, you know, you know, smart cookies, if you will, working in their legal department, and they're gonna be writing all sorts of of, of, scenarios in there, right, to give themselves freedom of action from those derivatives, that, that that even if, at a certain point in time, the license ends and they press the button, and and and the source files, let's say, from New York Times or otherwise, delete themselves, there's still other things that people don't understand from an AI, you know, technologies perspective that remain there that are still valuable. So let's go ahead and open up, the argument a little bit more here. We're gonna frame up kind of both sides of this legal debate. We are gonna pick into a few more nuances too. And I do also wanna let our folks, in chat know. Put your questions in chat. We will be doing a short q and a at the very end of our round table today to get your cues in front of our experts. So type away, get your thoughts in there. Okay. So let's go ahead and frame up these two sides, right, of this argument. First one is the New York Times side of things. So the New York Times, again, filed their lawsuit against OpenAI and Microsoft. They're alleging copyright infringement by using their articles to train and test chat GPT to build out their LLM. Let's start there. How do we answer this copyright question? Right? Because this seems to be really at the crux of it. Who owns the and by the end user. And is there any legal precedent from some of the previous friction filled eras of media and, excuse me, media redistribution, content evolution that we can learn from or shed some light on this situation? Again, how do we even approach this copyright question? Yeah. As a lawyer here, I guess I'll get started. Yes. So the the purpose of copyright law, I think as a lot of folks know, is all about economic incentives and social progress. So the goal is really to be progressing science and the arts by securing these exclusive rights to authors and inventors. So as Mark and Igor kinda talked about, there are two main kinda copyright claims and issues here. One is OpenAI's general collection of these kind of copyrighted articles for the purpose of training their LLM, and then secondary is the kind of distribution of that information, whether that's a derivative work. So on the first question, New York Times is arguing that their work is copyrighted, which I think is pretty undisputed. But the question is more of to what extent is it copyrighted. Right? So copyright protects these original works. It protects expression, but it does not protect ideas, facts, methods, things like that. And so New York Times is saying the general collection of their information that is copyrighted, including their artistic expression and journalistic, kind of methods is copyrighted work. The counterpoint by OpenAI on that point is that it's all publicly available. Even if it's behind kind of a paywall or a subscription model, oftentimes, that information ends up getting publicly distributed on the Internet anyways by third party websites, by Twitter, by social media, etcetera. So it opens up what is called the fair use doctrine or defense. So even if something is copyrighted, there are instances where the law allows us kinda unauthorized use of copyrighted works, because it's considered beneficial to society. So we're kinda asking the question, what is the purpose and the character of this kinda challenge use? What is the nature of the work? What is this a sub substantiality of the work, and what is the effect? So New York Times argues that this is completely anti, competitive to them. This is work that they are already doing. They put a lot of work effort, investment going into war zones, things like that to have this kind of content and expression that goes through a rigorous process. And, l m shouldn't be able to just freely use that information, to distribute to their users. Of course, OpenAI's counterpoint is that it's not anti competitive, that, they're just taking what is publicly available, and users are the ones who are dictating how they're using the product. So I will stop there and let other people jump in, but there's a lot more nuance to all of this. Let me take a shot here, Tatiana. I think that's really interesting as well if that they are thinking this when it's interesting that their dataset is proprietary. So Igor might be able to shed a little more light on this, but if you ask, OpenAI, you say, well, what is the source of your data? Right? They won't tell you. There there's not if you if you're following the industry, we're getting into what are called, open source large language models. Most famous ones are from Meta, and they're the llama family. And even those are not when I asked them, and I don't know if if Igor Igor's got a thought on that, but I asked them. I said, what what's the are you gonna show us the source of your data? They said, no. So think about that for a minute in in terms of how they're approaching this is they're saying, yeah. Yeah. It's all public information, but we're not gonna tell you how we source all this data. So I think that's kind of an interesting idea, and I would also say, there is, there are crosswinds and headwinds in this market a little bit because of the way the models are morphing. I mean, mutating so quickly. We're moving from super super large models to much smaller things that are based on different datasets. And you have some that are based on data that you can trace. So, that's gonna be part of this conversation as well that kinda enters into the whole mix of what's gonna happen. So if you meet a guy on the street and he says, hey. You know, I have a drug for you, and it's gonna give you a good time. And then you you, ask a follow-up question. Great. Where did it come from? And, you know, is there a fentanyl in it or things of that sort? And they're like, trust me. Trust me. Just take it, and they never reveal that. Of course, you start becoming suspicious, I mean, in in the first place. I think, again, building on top of what Tatiana said, They the publishers had detente with Google. Right? They would take snippets, put it in their index, but at least the audience would come to them. Now that attribution is is is, stripped away, and you're kept within, the boundaries of the Chatt GPTs, the Bings, the Bards, you know, the Copilots, whatever these, things wanna brand themselves out, the Grox, so on and so forth. And it essentially breaks, that model where at least you get an audience, and then you can figure out how to monetize, that audience and support your your respective business. Now, look, just because you can see something doesn't mean you don't own it. Right? So people can drive past my house and see me mowing the lawn. That doesn't mean they can walk into my my garage and and, and and cart the lawnmower away. Right? So I was planning on doing that after this, so I guess I gotta retool my plans. But, I wanna hear from Lauren on this one too, but let me throw in a little context here too. As a journalist myself, it really seems like, you know, one of the major quirks of this debate is the attribution side of things. When I think of how, you know, content gets reused by reporters, so if New York Times breaks a story, the local NBC or Fox station, they're probably going to reference that, maybe get another perspective from someone in their community, post that on their site, and then they'll backlink to the original reporting. The facts that were uncovered in, let's say, that war zone, or in that city council meeting, or, you know, in Congress, aren't now just proprietary to New York Times. No one else can comment on those facts. But again, it's attribution. So as we get into, essentially, how do we govern the attribution of this data across chat GPT? How do we even begin to, answer that question, especially with that journalism angle in mind? Well, Daniel, I started my career as a journalist, covering the European tech sector from London, and you're touching on something really essential for this conversation, which is industry standards and best practices. If you look at journalism, if you work for the New York Times or market scale, there are not just newsroom and publisher standards you are expected to uphold, but there are industry standards and best practices for being a responsible reporter, which you are expected to uphold. That, I would argue, is why the current media landscape is so problematic is that you have all of these one off creators who are producing content masquerading as news, but it is not beholden to those industry best practices and standards, which journalists are taught to abide by and uphold when they enter the profession. You look at law. You look at medicine, accounting. Any of these fields have very specific governing bodies and standards, which everyone everyone in the profession agrees to abide by. There is nothing of that sort in the data and tech landscape. And on the contrast, there is a lot of pushback towards that because it's it's argued to be a barrier to innovation when in actuality, what it perpetuates is exactly the types of challenges we're talking about now. It means that products are launched to market, and they have really incredibly negative consequences for various users. For as just one of many examples, facial recognition software is an area with well documented consequences for various people who, by the way, have no say in whether they are beholden to those products or not. So it's not even like you can use the excuse, well, you signed away your rights when you signed these terms and conditions. These AI products are used at a mass scale that we can't comprehend, but and they're used on people whether they consent or not. And so that lack of consensus around what responsible AI looks like, that lack of quality standards about which data to use, I think that cuts to the heart of the problem is that when it comes to data and AI and standards for what good looks like, for what responsible looks like. We're still in the Wild West, and we're seeing now all of these piecemeal advisory standards come out, but there isn't that one, you know, uniform body, and it's really hurting us as folks have already said. And I do think that this is just the tip of the iceberg there. Any other thoughts on this one, on that attribution angle and some of the best practices Yeah. Kind of pouring out of the journalism world. Yeah. Tatiana might have a thought on this. What what Lauren said, there's a piece to this that I think is gonna be more tip of the iceberg kind of stuff, which is, most people don't they're starting to understand that LLMs are brittle. There's a lot of there's a lot of wonderful wow stuff about it, but they they do a lot of things badly. Like, they they're biased because they're based on the data that they're trained on. Right? So you you have biased data. You have misinformation. You have disinformation. So no no offense to the New York Times. It's a great great publication, very legit, but but the models are trained on a whole bunch of stuff, which Lauren said originally, which was like tweets and garbage. So there's a lot of you know, it's garbage in, garbage out. Being where you go to the outputs, so would the content publishers start to say, you used our stuff and then created garbage. And so we're gonna sue you for, you know, misrepresenting us even if if even if they had permission. So now you got another big mess, that it could be potentially an issue is like, well, you're picking up pieces of what we said, but not the whole thing. So I think that might be an issue as well. I don't know what Tatiana thinks about it. Yeah. No. I think you're right, especially about the the kind of garbage in garbage out. A lot of so in my work, I work oftentimes with regulators and policy makers around how the law applies to technology and how it applies to AI. And oftentimes, what we say is that the sentiment is, like, technically, AI should not be the wild west. Right? Anything that an AI is doing that is already regulated by the law should still equally apply. The ambiguity is in how do you apply it. Right? And so issues of bias, a lot of places like the Federal Trade Commission and a lot of federal legislatures are really focused on this issue of disinformation and, bias and discrimination and how these systems can perpetuate, kinda what we already see as systemic inequalities in society, and how that can be applied to civil rights laws or where are the gaps. And so, again, going back to, like, the data use and the data attribution, it's, you know, who owns it? Is it an individual? Is it a personal privacy concern? Is it a information concern? Is it a copyright concern as we see in this case? And then, you know, again, what is what is the use case here? What is how is it being, used in the context of is it derivative? Is it transformative enough? So, like, in this in this New York Times versus OpenAI case, the question is whether it's transformative enough to be able to survive this, like, kind of fair use doctrine, question. So I wanna pose another big question here. This one, you know, there's so many layers to this, right, that need to be unpacked. There's obviously the journalistic best practices and standards coming down from the New York Times, that are coloring the debate. There's precedent around copyright law. There's the fair use discussion. But, there's also the big question around just sort of building guardrails, and actually structuring out a standard approach to data capture, data use, and perhaps compensation for said data, at different points across this, you know, ecosystem of use in an LLMs, you know, training life cycle, or the actual sort of long term use of the tool itself. So, we're going to hear here from a second from a panelist who unfortunately couldn't join us, but her name is doctor Joanna Massey. She's a board director and a corporate communications executive. So she's been, you know, in and around these larger copyright conversations for a while. And so I asked her essentially, do generative AI, tools have to pay their sources for continuous real time updating of their LLMs. Should they be paid at all? Should there be an upfront cost? So let's go ahead and hear from her on what she thinks that Nuance should look like, and then I want to open up y'all's thoughts to agree, disagree. What do you take away from it? So again, let's go ahead and hear from doctor Joanna Massey, board director and corporate communications executive. Back in the nineteen nineties when the Internet first became accessible to the public, all of the media outlets, the newspapers, the TV, put their content online for free. Only one newspaper didn't. That was The Wall Street Journal. They put everything behind a paywall. And at the time, everyone said, oh, dear. Oh, why are they doing that? Nobody's gonna pay for it online. They're gonna lose readers. Didn't happen. And fast forward, all of a sudden, all of the other media outlets, newspapers, magazines are scrambling to retrain the viewer, the reader, that, no. Actually, I'm sorry. You do have to pay for our content online. So they aren't going to make that mistake a second time, which is why we're starting to see lawsuits. I think the issue with the lawsuits is not that they're happening. I think the question is why did they take so long to happen? This should have come out. In my opinion, they could have sued immediately. This is very clearly a case where these large language models like CHAT g p t, like BARD, had been trained and educated on the media's content, a treasure trove of information that it just went in and freely grabbed. So for me, the question is not should they be paid. The question is how much should they be paid. So I and I think there's two levels to that payment in my estimation. One of them is how much should they be paid for the information that was used to initially train the LLMs? And then the second question is, how much should they continue to be paid on a regular basis to allow the LLMs access into their databases to continue to be trained and to continue to be able to give out information. Alright. So, again, kind of the crux of the argument there is not should they be paid, but how much and when in that larger life cycle. So I want to open it up for discussion here. Any agreements, disagreements, as I get the wheels turning for any of y'all on this question of attribution, payment for said attribution, and usage of data to inform LLMs? Yeah. I Well, my initial thought is that this feels like a very, American conversation because if we look at Europe and GDPR legislation, that legislation attributes many more rights over one's personal data to the individual citizen. Europe also has legislation that it called the right to be forgotten, which allows people to talk to search engines like Google and request that certain search results about themselves be removed, and they have the right to do that. They also have the right under GDPR legislation to query any business that has information on them to ask how that data, which the business has collected, is being used. And if the business cannot come up with a suitable response, then under GDPR, that citizen has the right to sue them. And I believe that businesses can be charged up to six percent of their annual revenue in costs if they cannot prove data lineage, to give a substantial response for how they are using one's personal data. But, again, that is in Europe. And even though that legislation applies to any company, including American companies, which collect data on European citizens, we, as Americans, do not have really any federal rights to our own personal data. So this is something that I think Tatiana as a as a legal expert can speak about more. But the concept of data and information as free flowing currencies is, I think, a uniquely American perspective when we look at the global landscape and especially Europe. Yeah. I I Daniel, real quick. Sorry. The, two things real fast. These content first of all, let's not, paint these publishers as as innocence in this. They're all negotiating. So if you notice the story with the New York Times is they the negotiations broke down. So they were in the process of negotiating a deal. Axel Springer is a a publisher that's already cut deals. There's, a few others that have. So they're they're gonna get they're looking for their money. I will give you one other thing that I think is gonna be interesting, and it's digital watermarking. Like, like, like, some industry standard an industry group. One's called the Content Authority Initiative con Content Authentication Initiative. Excuse me. So it's for images, video, and there's a new paper that was just written and published by some Chinese researchers about text watermarking. So here's where I'm going with this. The publishers are gonna be able to defend themselves. So they might get the chance to say no or you pay me. So I wanna go back to Lauren's point about, the the data attribution. And I think the issue is less about the publisher as New York Times, and more about the precedent that it sets on a societal level. Remembering that, like, copyright is all about economic and societal incentives. So it's, basically saying that, you know, all data out there on the Internet is free for everybody to use all the time. And I think that does set a very dangerous precedent, not only for copyright reasons, but for privacy reasons for, companies that hold this data, etcetera. My biggest hesitation against a licensing scheme though is, the kind of market capture that could happen if only the largest companies are able to pay for this kind of content. It really excludes a lot of other companies that are smaller and it excludes startups and all these other companies from being able to train these same kind of models with the same level of accuracy. Yes. I agree with that. And that's actually a common argument against, this this type of reward scheme is that the the legislation is often cited as one that would hurt small and midsize businesses the most because they already cannot compete with the top five US tech companies, Microsoft, Google, Apple, etcetera. And so those companies, if they do break the law or skirt the law, they have the resources to pay those fines. We see this already. Meta gets sued constantly, and they're able to just doll out the money because they earn so much. The average small and midsize business cannot afford to pay six percent of their annual revenue for not having proper data lineage tracking on citizens. And so this legislation, it is worth noting, while I think there's certainly value in it, it does have to be structured in a way that it does not indirectly promote antitrust or or enhance existing antitrust, because it could have a real detrimental effect on competition in the market, and it could really hurt innovation. I I I'd have to disagree a little bit because you're ignoring the open source, options that are available right now. So the this is your this is a saying, and there's a suit right now by the EU against OpenAI and Microsoft about, collusion or, you know, getting together. Put that aside. What what we're assuming here is this is the only large language model available. That's that's not true. And the the models are changing daily. So what's available today to almost anyone is much cheaper, open source models that, or matter of fact, might be the what the prevailing wind will push into the marketplace for language models, which is open source. So that has to come into play. But to me, you're pushing that down the road a little bit. We're talking about an issue right now that has to do with large language models. Yeah. But, Mark, one of one of the things that she mentioned, it's it's it's the ability for these, SMBs and and mid market companies to compete. Right? A a lot of these open source models are fake open source. Like, if you actually read the terms and conditions for these LAMA models, they don't allow it to be used for critical infrastructure. So a commercial entity can't even try to support a hospital system, you know, utilities, you know, water treatment plants, and things of that sort as well. But they're getting this goodwill where everybody's, you know, pointing at them as an as an example of a a freely usable model, but but it's not. I mean, there are many of these models when you actually read that their t's and c's, they do permit use in research style, situations, but they don't allow you to to, take them forward and commercialize them. So from her standpoint, it's the inability of of of new entrants to join this this market and and have some sort of commercial excess that's going to be problematic. Why do you think all of these, essentially AI oligarchs, showed up in front of congress, in front of senate? Because they're gonna do exactly what the banking sector and many other, regulated industries did. They're gonna create this this glide slope, where where they're gonna have just enough, regulation in order to generate that moat, but enough freedom of action so that they can continue surviving and thriving, their respective businesses. And and and that's we're gonna have to be careful, which is why I think we're hopeful when we look at the EO, they're at least aware that that that danger, that, exist, you know, with respect to, you know, up and coming, entities, right, in order to have a thriving and and free market. And, you know, the the jury's out on on how, they'll settle down. Right. But please note that the suit is against OpenAI and somebody with money, which is Microsoft. So OpenAI, if it's a standalone company I mean, I've looked at these guys for a while. You're they are not making a whole lot of money yet. So it's like, well, okay. Are we saying, you know, this mark and and in LLMs in general right now, if you look at the other there's some, you know, some, what do they call them, unicorns out there, excuse me, like, anthropic and cohere that are private that are modeled after OpenAI. None of these people are making money. Now that, you know, that might change over time, but it's interesting to me that the New York Times, you know, asked Microsoft along who obviously has quite a bit of money, and OpenAI does not. It's still these SMBs, and and mid market companies still have to follow proper, governance. Right? They have many of them have boards installed. And so whatever precedent gets created, you have to, you have to do that. Otherwise, you're you're willfully negligent. Alright. We're gonna move into the end of our, round table here. Four of y'all have been great so far. Thank you so much for the back and forth and really unpacking this question, because it is still that. It's a big question, and so there aren't that many answers yet. But with that in mind, it is a content roundtable, so I do want to try to plant a few flags here. I know it's hard to really, like, say in which direction this will go, but do you have a sense for who might have the stronger argument in this case? Or maybe the better question is, what do you think may define the strength of the New York Times' versus OpenAI and Microsoft's argument as this debate continues to go through the legal system? Let's go ahead and get a lightning round from each of y'all on this, then we'll go to some q and a and wrap things up. So again, does anyone have a stronger argument on, you know, either side of this aisle? Thoughts? Yeah. Oh, no. I am not a copyright lawyer. Everything that I've read is just generally from a legal perspective. I so I actually don't have strong thoughts about which way this will go. But what I do think is New York Times' strongest argument is on, the derivative use of their content. So, like, if you they have a couple instances cited in their complaint where CATGPT specifically regurgitates exact block texts from their articles, and I do think that is the strongest case that they're gonna be able to make rather than the collection of the data despite my own personal beliefs. Yep. And as someone who is also not a copyright lawyer, something that struck out stuck out to me from OpenAI's response is a blog post they published which said, training AI models using publicly available Internet materials is fair use as supported by long standing and widely accepted precedence. So if we're talking about precedence, that's not necessarily even a question of what is is legal, although it could be and that's heavily implied. But it, again, it goes back to what has been done in the past. And it no. Again, what has been done in the past as precedent, that does not mean that whatever has been done is right or wrong. And but it is interesting to me that that is interesting to me that that is OpenAI's argument here is that, well, we're just doing what has already been done before in many different circumstances, and it's fair by virtue of being in the public eye. And the question is, is that precedent going to be enough to carry us into the future? The reality is that even if you argue that the current legal landscape is not enough to protect publications like the New York Times, which is a fair argument in and of itself, how long is it going to take the legal profession to create and, you know, execute new laws that do protect publishers and citizens from cases like this. We know that that is going to be an enormously long runway in comparison to innovation in this field, which can seem to jump decades ahead in a few months' time. And so that's interesting to me that OpenAI would cite long standing and widely accepted precedent as a legal shield for themselves. Mhmm. Yeah. And if I might be able to jump in really, really quickly on Lauren's point there, precedent is very fact specific. Right? So one, case that keeps coming up is this Google case about how Google Books, has indexed a bunch of copyrighted works within their platform, and that was found by the court to be fair use. New York Times is gonna be able to easily distinguish that and be like, that actually benefited the authors of those books by being able to show those previews to people, and it was not anti competitive. In fact, it helped the authors be able to get a larger base. Where in this case, it's gonna be harder for OpenAI to make that case and saying that, their platform is going to be able to aid, you know, content creators more. I think OpenAI is gonna settle this, and it's gonna go in the favor of New York Times is gonna get what they want out of this. But it's not gonna go to court, and it's not gonna it's gonna settle. Yeah. So so, members of my team helped create Alexa series, Watsons, and things of that sort. So what they're talking about in terms of generally accepted practices is wrong. It's sort of going back to, to the previous depiction just because that, you know, a person can claim, hey. I've seen other people steal lawnmowers. That's why I'm allowed to steal lawnmowers. But when we were building language models, for the for these things that you know and love as as consumer, AI brands, we properly licensed content. We had our own proprietary content. We we properly sourced open source assets, and we created synthetic data. We did not, you know, trip over, you know, ingesting copyrighted, content. You know, that was part of of, you know, essentially the the rule chain, that we operated in to be good stewards and and good citizens in in developing, you you know, our versions of responsible AIs. So, you know, pointing to, you know, some sort of fictitious, hey. Other people are doing it, is, you know, where where are these people? Because all the folks that I know that were legacy practitioners in in the AI field were not doing that. Alright. Yeah. And that's synthetic data content is and point, Igor, is really important. That that is common practice in the data space. If you're a data architect, a data scientist, engineer, if you are working with what's called personally identifiable information, also known as PII, you are expected to create synthetic data for your training, to ensure that you are using data best practices while keeping, you know, users' personal information private. So you create synthetic data to mimic what the you believe the model will encounter once deployed, so that it knows how to retrain itself. It knows how to create accurate predictions, which also keep data safe. And and there is this actually established precedent for, you know, addressing these very types of issues. And so the fact that OpenAI can't point to having done that despite all of its wealth as an organization is interesting to say the least. Alright, y'all. We're gonna go ahead and wrap things up with one last question here from the audience. So this is coming from Ian. Ian has questions around how this larger case, and maybe some of the precedent it will set, might impact hobbyists and smaller players that have an affinity for generative AI and building their own LLMs. So he says computer enthusiasts can run their own local LLMs using freely available datasets like Llama uses. Could hobbyists be open to suits for stolen data? So that's Ian's question. Anyone have an answer there for him? My question I I think it's a great question, and I think it goes back to what is the source. If you're using, you know, open source models, if you're using open source data that then I don't, on the face of it, see an inherent issue with that. In that regard, it reminds me very much of the open source landscape in general in general as it pertains to code and how, you can collaborate across projects and and time zones to create new innovations. I will say that the OpenAI, the open data landscape is still very much growing, in relation to its code counterpart. And so there might not be as much readily available on the data front as there is code because data brings a lot more inherent challenges with it than software code does. But I would say when it comes to creators and hobbyists being vulnerable, it really goes back to the source of the data and what they're doing. Now, LLMs are often created using data scraping, which is where you take large swaths of data off the Internet and don't really distinguish from its source. So that is something that I do think would leave hobbyists vulnerable depending on what they do with the model. But, inherently, when it comes to experimentation, if you if you're inherently talking about open source and staying within that ecosystem, I don't see anything wrong with that in particular having but, again, it goes back to the source. If they start using, you you know, proprietary information from publishers, from organizations, then they could see some legal issues. Yeah. I'm actually also gonna speak candidly that, in most cases, it's actually probably not gonna matter. And that's because, nobody's going to enforce against hobbyists. I mean, maybe there's a one to five percent chance, but my guess is that they don't really care about hobbyists. They care about the large companies that are profiting substantially off the data because that's where the highest risk often is. Again, there's a couple exceptions to that unless you're creating really high risk type models or type systems. But in most cases, it's really not gonna matter a ton. In the short term, they won't care. In the long term, if you download, you know, copyrighted movies and things of that sort, they can find you through your ISP. Right? So in the short term, it won't matter. But look, if you're if you're, using that person who took my lawnmower away, you're still interacting with a stolen good if you ask to borrow that lawnmower, that was originally attributed to me. And sooner or later, once they finish getting their gravy trained from the large scale AI companies and the mid market ones, you know, they'll they'll have a process in place for saying, hey, hugging face, show me all the folks that downloaded this particular model. I'll do a cease and desist, get them to stop using it. If they refuse or I see them, you know, putting it into some sort of, you know, production use, then, yes, you know, they'll they'll come after you. Not in the short term, but in the long term, you know, it's your job to defend your your your works. It's an interesting debate, and, you know, it kind of takes me back to I don't know. Is this gonna be the era where torrenting comes back, but the LLM version for hobbyists, you know? Maybe, I don't know, maybe BitTorrent is gonna see a rival in the AI and LLM build out space. Who knows? But I feel like we are in that type of conversation around, you know, enforcement and around setting these guidelines. I I don't think we're gonna be able to land at, you know, a final solution here anytime soon, but I do think this case is going to be, like everyone has basically said on the roundtable today, the tip of the iceberg. And how this one settles, or doesn't, or goes to court, or who knows, is going to start to give signals to the market on how it should respond, how it should shore up its defenses, and potentially, provide some guidelines for legal framework as well down the line. So on that note, we'll go ahead and wrap things up. Thank you everyone for your perspectives today, for your commentary. This has been a really powerful debate and discussion. I want to go ahead and thank our panel lists again. We'll go down the line. Thank you again to Igor Jablyakov, CEO and founder of Prion. We were also joined by Tatiana Rice, senior counsel for the Future of Privacy Forum, Mark Bequeu, AI research director at the Futurum Group, and Lauren Maffeo, author of Designing Data Governance from the Ground Up. Thanks to the four of y'all. This has been such a treat, and I can't wait to continue this conversation, because I think we're gonna have to have some touch points on this as the year continues. So we'll definitely be in touch. Thanks, everyone. Thank you. Thanks. And thank you, everyone, for tuning in to today's episode of Experts Talk. If you enjoyed what you saw here today, make sure you head to our website, marketscale dot com, for previous episodes of our roundtable debates and discussions, and make sure that you tap into commentary for the rest of the week. We're gonna be going live on marketscale dot com and LinkedIn with more Experts Talk Roundtables, so check out our calendar for what's coming up next. Also, if you post some questions in chat, we will be sending those off to our experts here after the fact, fact, so we might still see your content published on marketscale dot com. If you had a question, you still got a few minutes, go ahead and send it in. We'll be sending it off to our experts for their perspectives, which will, again, flip on marketscale dot com. So join us again tomorrow live at ten AM central with my colleague, mister Ben Thomas. He's our ProAV lead here at Market Scale, and he's gonna be going live with a ProAV roundtable on what the hottest markets are going to be for the AV industry in twenty twenty four. Which industry verticals have the most opportunity for growth, and how can AV players, whether they're integrators, solutions designers, etcetera, make the most of that energy for not only the expansion of the industry, but obviously some dollars. So, make sure you tune in tomorrow live ten AM central for that discussion. Until then, I'm Daniel Litwin, the voice of B2B. We'll be back live tomorrow with another Experts Roundtable here on Experts Talk.
About the author
Daniel Litwin is a journalist of multiple disciplines focused on finding and telling engaging stories for B2B communities. He has interviewed executives from Fortune 500 companies including Honeywell, Microsoft, John Deere, and Chipotle, and leads editorial direction at MarketScale. Litwin hosts weekly shows and podcasts while helping develop new content approaches across the MarketScale platform. He holds a B.J. in Radio/Television Reporting/Anchoring and a B.A. in Spanish from the University of Missouri-Columbia.