Duplicate content

3 minutes

Duplicate content, also known as duplicate content, refers to identical or very similar text passages that can be found on several different URLs on the Internet. This can occur both within a single domain (internal duplicate content) as well as on different domains (external duplicate content) and represents a problem for Search engines a technical challenge.

What is duplicate content?

In essence, duplicate content is when the same or almost identical content can be accessed via more than one web address (URL). For an Internet user, this may often seem insignificant as the content remains the same. From the perspective of a search engine such as Google however, each URL is a separate entity. If content appears on more than one of these entities, the search engine must decide which version is the relevant one and indexes as well as ranked should be. This can lead to problems when assigning relevance and authority.

Common causes for the creation of duplicate content are:

URL parameters that lead to an identical page view (e.g. session IDs, tracking parameters).
Accessibility of a page via HTTP and HTTPS as well as via www and without www (e.g. http://example.com, https://example.com, http://www.example.com, https://www.example.com).
Missing or inconsistent use of trailing slashes (example.com/page/ vs. example.com/site).
Print versions of websites that can be accessed under a separate URL.
Categorizations or tags in Content management systems (CMS), that display the same content under different URLs.
Automatic scraping or the unintentional publication of third-party content.
Content syndication, in which articles are published on partner sites.

Effects of duplicate content on search engine optimization

Duplicate content can have a negative impact on the Ranking of a website in search engines. The main problems that result from this are:

Dilution of ranking signals: If search engines find the same content on several URLs, the Link power (link equity) and others Ranking signals, that link to this content are split between the various duplicates. This weakens the authority of the original or preferred page and may prevent an optimal ranking.
Inefficient use of the crawl budget: Search engine crawlers have a limited budget for crawling a domain. If a large part of this budget is used up indexing duplicate content, there are fewer resources left for crawling and indexing new or important, unique content.
Difficulties with canonization: Search engines always try to identify the original source or the preferred version of content (canonicalization). In the case of duplicate content, this process can be flawed, resulting in an unwanted version being indexed and included in the Search results is displayed.
Lower visibility: To avoid redundant display in the search results, search engines usually only show one version of the duplicate content. If the wrong or a less relevant version is selected, the overall visibility of the domain suffers.

Targeted technical measures are necessary to avoid the negative consequences of duplicate content. These include the use of rel="canonical"-tags, which signal the preferred URL to search engines, as well as the implementation of 301 redirects, to redirect old or duplicate URLs to the desired target URL. The correct configuration of the CMS and the use of parameter handling in tools such as the Google Search Console are essential.